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INTRODUCTION 


Tus issue of the Review devoted to educational and _ psychological 
testing follows the general organization of the most recent issue on this 
topic (Volume XXIII, No. 1) with the exception that no separate chapter 
is given to the testing of educational achievement outside the schools and 
that a separate chapter has been given to the development of statistical 
methods especially useful in test construction and evaluation. No effort is 
made in this introduction to summarize the various chapters. More ade- 
quate understanding of the trends and accomplishments of the past three 
years can be gained by reading the chapters themselves. 

During the preparation of this issue of the Review the chairman wrote 
to over 125 members of the American Educational Research Association, 
presumed from their listing in the List of Members in the December 1954 
issue of the Review to be in positions associated with educational and 
psychological testing. A request was made for copies of, or references to, 
publications representing research studies on testing or involving the use 
of test data not easily discovered in the usual sources. A great many ref- 
erences and publications were received, evaluated carefully, and sorted with 
respect to relevance to the various chapters of this issue of the Review. 
The materials relevant to chapters other than the first were sent to the 
respective chapter authors. We are grateful to the members of the AERA 
who so generously cooperated. 

It is hoped that, because of limitations of space and the necessity to be 
selective, we have not failed too often to include studies which deserve 
recognition. It had been planned to include a chapter on tests of general 
mental ability, but, because of unusual, heavy responsibilities, the author 
of this chapter was unable to complete it in time for it to be included. 


Max D. ENcELHART, Chairman 
Committee on Educational and Psychological Testing 

















Aaa a 40s 





CHAPTER I 


~~ 


= 
| 
Testing and Use of Test Results 


MAX D. ENGELHART 


‘Tue realm of educational and psychological testing ranges from the fron- 
tier of test theory thru test production to the more settled regions of the 
use of tests in the measurement and evaluation of characteristics and abili- 
ties of children and adults. Within the space of this chapter it is impossible 
to do more than cite typical developments and call attention to important 
sources of information. 


Developments Contributing to the Improvement 
of Tests and Testing 


The Technical Recommendations for Achievement Tests (4), prepared by 
committees on test standards of the American Educational Research Asso- 
ciation and the National Council on Measurements Used in Education, 
and the Technical Recommendations for Psychological Tests and Diag- 
nostic Techniques (5), prepared by a joint committee on test standards 
of the American Psychological Association, the AERA, and the NCMUE 
should prove important factors in the improvement of educational and 
psychological tests. Used with the Fourth Mental Measurements Yearbook 
(11), they should also be valuable to persons interested in selecting tests 
for use and to students seeking comprehensive knowledge of the testing 
field. Both sets of technical recommendations and the criticisms char- 
acteristic of most of the reviews of the Fourth and earlier Mental Measure- 
ments Yearbooks should make inexcusable future publication of tests so 
frequently open to such criticisms as to necessitate the compilation of the 
technical recommendations. 

Especially important to improvement of achievement testing and to 
instruction, particularly on the level of higher education, are the Taxonomy 
of Educational Objectives (9), prepared by a committee of college and 
university examiners, directed and motivated by Bloom; General Educa- 
tion: Explorations in Evaluation (14), written by Dressel and Mayhew, 
but reporting the cooperative efforts of numerous college teachers; and 
Evaluation in General Education (15), by Dressel and a number of college 
examiners. The Taxonomy, now available in a preliminary edition, is a 
thoro analysis and organization of instructional objectives in the “cogni- 
tive domain” and presents numerous examples of types of exercises, both 
objective and essay, presumed to be valid in evaluating the objectives de- 
fined. General Education: Explorations in Evaluation also deals compre- 
hensively with objectives and with types of exercises, frequently ingenious 
ones, instrumental in evaluating the objectives listed. Evaluation in General 
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Education, while essentially a series of chapters describing the testing 
programs in general education of a number of colleges and universities, 
should similarly contribute to the production of tests and to testing prac- 
tices extending evaluation from the measurement of knowledge or infor- 
mation thru the hierarchy of intellectual skills into the less tangible out- 
comes of education—interests, appreciations, and values. There is also 
evidence of this trend on the elementary- and high-school levels. 

Contributing also to the production of better tests and improved prac. 
tices in the interpretation and use of test data are developments in test 
theory and in statistical methodology. The author can mention here only 
the more important sources of information. Impressive indeed is the 
December 1954 issue of the REviEw entitled “Statistical Methodology in 
Educational Research” (3) with its scholarly introduction and chapters 
on sampling surveys, scaling, regression and correlation, discriminant 
analysis, factor analysis, variance and covariance, statistical decision 
theory, and nonparametric methods. In this, in recent volumes of Psycho- 
metrika, Educational and Psychological Measurement, the Journal of Ex- 
perimental Education, and other journals, in monographs and theses, in 
research reports of various organizations concerned with testing, and in 
papers presented at meetings of the Psychometric Society, Division of 
Measurement and Evaluation of the American Psychological Association, 
and other professional associations, and at the annual Invitational Confer- 
ence on Testing Problems are reported advances in test theory, technical 
aspects of test construction, and use and interpretation of test data ex. 
tensively treated in later chapters of this issue of the Review. 

Earlier issues of the Review devoted separate chapters to developments 
in testing “outside the schools.” In our distribution of the field among the 
various authors of this issue we have not made a division of this kind. It 
may be of general interest here, however, to present a brief summary of 
the kinds of research carried on by the Bureau of Naval Personnel as 
typical of testing research, whether Army, Navy, or Air Force: 


1. Improvement in selection measures, especially with reference to non- 
cognitive behavior and specialized abilities in areas of technical competence 

2. Development of achievement measures designed to test problem- 
solving ability, involving construction and validation of items representing 
an integration of knowledge acquired rather than recall of factual infor- 
mation 

3. Development of proficiency measures, especially of the performance 
type, for the purpose of assessing achievement of learning objectives and 
knowledge of content in terms of the practical applications to assigned 
duties.* 


* From a letter from Russell A. Beam, head of the Training Evaluation Section of the Bureau of Naval 
Personnel. This research is reported in the Technical Bulletins of the Bureau, some of which are classified 
as to security, and in Research Reports published under the Office of Naval Research Contracts. For a list 
of research studies conducted in the Personnel Research Branch of the Department of the Army see item 56 
in the bibliography of this chapter. A number of Air Force studies are cited elsewhere in this issue of the 
Review. 
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Developments in Testing Programs 


Reference has already been made to sources of information with respect 
to developments in testing programs on the college level. Durnall (16) sur- 
veyed testing programs in junior colleges, and Robinson (42) investi- 
gated the use of standardized tests in the public secondary schools of Penn- 
sylvania. Traxler (52) recently investigated the status of testing programs 
conducted in large city-school systems. From questionnaire data received 
from 40 cities of more than 250,000 population, he reported widespread 
use of such intelligence tests as the Otis Quick-Scoring Mental Ability 
Tests, the California Test of Mental Maturity, and the Kuhlmann-Ander- 
son Intelligence Tests, with an increasing trend toward the use of tests 
which provide differential measurement of aptitudes such as the Differential 
Aptitude Tests. Among the widely used achievement batteries on the ele- 
mentary-school level are the Stanford Achievement Test, the California 
Achievement Tests, the Metropolitan Achievement Tests, and the lowa 
Tests of Basic Skills. The Cooperative achievement tests are the most 
widely used on the secondary-school level, but there is increasing use being 
made of the World Book Company’s Evaluation and Adjustment Tests. 
Interest and personality inventories extensively used in city-school systems 
include the Kuder Preference Record and the Bell Adjustment Inventory. 
Less widely used than the Kuder are the Strong Vocational Interest Blank 
and the Lee-Thorpe Occupational Interest Inventory. Some use of locally 
constructed achievement tests is made in slightly more than half of these 
large cities. In three-fourths of the cities there is much use of tests in 
guidance, and in three-fifths the test results are used extensively for in- 
structional purposes. Other uses mentioned are for administration and 
supervision, presumably for the general improvement of instruction and 
guidance. Traxler (54) also listed 12 current trends in testing based 
on data collected by questionnaire from 33 research directors of large 
city-school systems. These trends include increasing unification and coor- 
dination; greater effort to select tests in relation to fundamental objectives; 
more systematic planning to secure comparable results from grade to 
grade and year to year; more use of interest and personality inventories, 
anecdotal records, projective technics, and sociometric devices; and 
greater attention to inservice education of teachers and counselors in use 
of test results. Durost (17) also emphasized the need for inservice train- 
ing of teachers in evaluation and measurement. 

Leake (29) provided a very informative description of the survey and 
instructional or diagnostic testing programs of the Denver schools. Most 
city surveys of school achievement present data on the status of pupils in 
different grades. Such surveys are less effective in evaluating growth in 
achievement than data obtained from repeated testing of the same pupils 
as they progress thru the grades. Lillie Bowman of the San Francisco 
public schools (45) conducted a survey of this kind. Howard Bowman 
(10) described how machine tabulation is used in the Los Angeles schools 
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in the production of “group analysis charts” to make intelligence, achieve. 
ment, and a third variable more meaningful to pupils, teachers, and par- 
ents and more helpful in the identification of deviates. The third variable 
may be chronological age, average mark, or interest area. 

Unusual byproducts of program testing in city schools are exemplified 
by a compilation of devices useful in the improvement of instruction re- 
sulting from study of achievement test data obtained in 1951 and 1954 
in the Los Angeles schools (31) and by the comparison of achievement in 
reading, arithmetic, and spelling of Evanston, Illinois, public-school pupils 
in 1934 and 1953, as reported by Lanton (28). Most of the differences 
listed are significant at the 5-percent level, or better, and favor the pupils 
of 1953. More studies of this kind might counter the frequently voiced 
adverse criticisms of public education. 

Traxler (53) reported that statewide testing programs are conducted 
in 26, or 54 percent, of the states. In 10 states such programs are carried 
on by the state department of education, in 11] states by a college or uni- 
versity, in two states by both the state department and an educational insti- 
tution, and in three states by an association or commission. In 14 of the 
states the testing is restricted to the high-school level and most frequently 
to Grade XII. The trend, however, is downward in educational level. Cen- 
tral scoring services are available for more than four-fifths of the state- 
wide testing programs. Eight states account for the majority of tests used 
in such programs—Ohio, New York, Tennessee, Minnesota, Iowa, Vir- 
ginia, Illinois, and Wisconsin. Traxler (53: 91) concluded that these pro- 
grams can be improved by “greater emphasis on testing at the lower 
grade levels for purposes of instruction and guidance, more extensive par- 
ticipation . . . on the part of the smaller rural schools, somewhat more 
systematic administration of the tests throughout the state from the stand- 
point of the tests used and time of year at which the tests are given so 
that better statewide norms can be prepared, more extensive use of central 
scoring services, and particularly more help for the schools in the effective 
use of the test results.” Statewide testing programs were also discussed by 
Berdie (8) who included a detailed description of the Minnesota program. 

Research in the testing field is often a concomitant of the more routine 
activities of those engaged in conducting testing programs. For example, 
a number of such studies were reported in the Educational Records Bureau 
Bulletins following the reporting of the data collected in the Achievement 
Testing Program in Independent Schools. These and other such reports 
of research done by members of testing staffs of city, state, and college or 
university testing programs are considered elsewhere in this issue where 
relevant to the topics of later chapters. It is evident also from letters and 
materials received from various members of the AERA that a considerable 
amount of research on problems of local interest is accomplished thru 
using test data. A few examples will be given. Orleans (39) compared the 
academic aptitude of teacher education students with the aptitude of other 
entrants of the four municipal colleges of New York City. In 1953 a com- 
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prehensive follow-up study of students entering in 1947 was completed 
at the University of Kansas, a report of which can be obtained on request. 
A number of the bulletins of the Department of Testing and Research of 
the Public Schools of Dearborn, Michigan, are research studies of local 
significance. In addition to the administration of citywide testing and 
evaluation programs, the testing or research bureaus of large city-school 
systems conduct research on local educational problems, produce tests and 
scales for local use, and engage jn studies evaluating for local use nationally 
standardized tests and other measuring instruments produced elsewhere. 
Activities of this kind are among the functions of the Bureau of Educa- 
tional Research of the New York City Schools (36) and of similar bureaus 
of other cities. 

Examples of efforts to make testing more meaningful to classroom 
teachers, school administrators, and parents include discussions of testing 
in yearbooks and bulletins of educational associations and in reports of 
conference. proceedings (1, 12, 13, 32, 34, 46, 57) and nontechnicai 
brochures (44, 58). Many reports prepared for local distribution by 
directors of testing programs strive to make test results meaningful to 
classroom teachers and administrators by definitions of technical terms, 
descriptions of what is measured by the tests, explanations of derived 
scores and norms, illustrative interpretations of data pertaining to indi- 
vidual pupils, and extensive use of graphs. Reports containing one or more 
of these features were received from Baltimore, Maryland; New Bedford, 
Massachusetts; Santa Monica, California; and other cities. Lennon (30) 
analyzed the problem of making test manuals more understandable to 
teachers, and Ebel (18) discussed problems of communication between 
test specialists and test users. Ebel (19) also described procedures useful 
in reporting analyses of objective tests to the teachers who constructed 
them. Engelhart (20) emphasized local test construction as a means of 
making testing more meaningful to teachers. Kvaraceus (27) and Noll 
(37) presented papers on inservice training in measurement thru university 
extension courses and on the introductory course in educational measure- 
ment. Mention should also be made of the growing number of workshops 
on testing for teachers, a number of which have been sponsored or directed 
by the Educational Testing Service. 

Eloise Cason, the child guidance director of the public schools of Bloom- 
field, New Jersey, described in a letter to the author how teachers can be 
helped to use test results intelligently: 


1. In each elementary school I have meetings with the staff and/or new 
teachers on how to “read” a permanent record card. Records of children 
in the school are duplicated and used as a basis for discussion. It is my 
impression that the average teacher learns most effectively about the use 
of tests by focusing on their meaning for understanding a particular child. 
Test data are related to other information. It is also poss a these meet- 
ings to do a bit of educating on validity, reliability, etc. Similar meetings 
are held at other school leve 
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2. In the junior high school the guidance staff prepares summaries of 
test data to help the teacher plan for the group with which she is working. 
(There is some “ability grouping” at this level with adjusted programs. ) 

3. When new tests are introduced in skills or the content fields, the 
teachers concerned are asked to evaluate the suitability of the material in 
the tests in relation to their program. I believe this procedure is greatly 
appreciated, and removes some of the fear of tests. 


While use has been made of item analysis data as a means of making 
test data more meaningful in testing programs conducted in North Caro- 
lina and in the Denver, St. Paul, and Chicago public schools, the study 
reported by Fish (21) is notable for its presentation of such data relevant 
to an extensive number of language skills of fourth-, sixth-, and eighth- 
grade pupils in the Tennessee schools. In the opinion of the author, there 
should be more studies of this kind since such data can often be much 
more concrete evidence of pupil achievement than comparisons of total 
or part scores with test norms alone. 


Sources of Information on Testing 


In addition to the Fourth Mental Measurements Yearbook, bibliograph- 
ical sources on tests and their use include Swineford’s annotated lists of 
selected references (47), the February 1953 issue of the Review of Edu- 
cational Research (2), and the chapter entitled “Use of Tests in Educa- 
tional Personnel Programs” by Hastings and McQuitty (25) in the April 
1954 issue of the Review. A bibliography of 67 references on educational 
measurement was prepared by the Research Division of the National Edu- 
cation Association (35). A student can obtain additional references by 
consulting the Education Index and other guides to periodicals and books, 
Psychological Abstracts, Dissertation Abstracts, and the book review sec- 
tions of professional journals. 

A number of new texts appeared. Especially notable and of general 
interest are Psychological Testing by Anastasi (6), Measurement and 
Evaluation in Psychology and Education by Thorndike and Hagen (49), 
and Educational Measurement by Travers (51). Also notable, but of more 
specialized interest, are the new edition of Guilford’s Psychometric Methods 
(24), Meehl’s provocative book, Clinical Versus Statistical Prediction (33), 
and Remmers’ An [ntroduction to Opinion and Attitude Measurement (40). 
Those engaged i conducting testing programs or in learning about them 
should read Introduction to Testing and the Use of Test Results in Public 
Schools by Traxler and others (55). 

Other new texts include Bean’s Construction of Educational and Per- 
sonnel Tests (7), Jordan’s Measurement in Education (26), Odell’s How 
To Improve Classroom Testing (38), Thomas’ Judging Student Progress 
(48), and Torgerson and Adams’ Measurement and Evaluation for the 
Elementary School Teacher (50). Recent revisions of widely used texts 
include Greene, Jorgensen, and Gerberich’s Measurement and Evaluation 
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in the Elementary School (22) and Measurement and Evaluation in the 
Secondary School (23), Remmers and Gage’s Educational Measurement 
and Evaluation (41), and Ross’s Measurement in Today’s Schools (43), 
revised by Stanley. 

In concluding this section, it seems not inappropriate to mention the 
Test Service Bulletins of the World Book Company and of the Psycho- 
logical Corporation, the College Board Review, ETS Developments, and 
Annual Reports of the Educational Testing Service, and similar publica- 
tions as sources concerning new developments in test production and test 
use that provide much information useful to the test administrator in 
obtaining tests and in using them more intelligently. 
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CHAPTER II 


Development and Applications of Tests 
of Special Aptitudes 


¥ 
ROBERT L. THORNDIKE 


Derininc the range of coverage of this chapter presented certain prob- 
lems. One of these was that of defining what should be considered a special 
aptitude test. In general, material on scholastic aptitude tests is considered 
in this chapter only when the study is concerned with a number of 
separate scores. Factor analysis studies have not been included since 
these were reviewed in a recent issue of the Review of Educational Re- 
search (68). Articles dealing with statistical methodology as.it applies to 
testing are considered in the last chapter, and so the only methodological 
articles considered in this chapter are those that illustrate a methodological 
point in aptitude testing with substantive data. 


New Aptitude Batteries and Tests 


Just as World War I provided the impetus for large numbers of group 
intelligence tests, so World War II stimulated the production of aptitude 
batteries. The prewar Primary Mental Abilities batteries were followed 
shortly after the war by the General Aptitude Test Battery (GATB), the 
Differential Aptitude Tests (DAT ), and the Guilford-Zimmerman Aptitude 
Survey. During the period covered by the present REviEw, several addi- 
tional batteries appeared. 

The California Test Bureau, which previously produced Roeder and 
Graham’s Aptitude Tests for Occupations (59), published in 1955 Sege! 
and Raskin’s Multiple Aptitude Tests (66). This battery contains nine 
subtests, and these are combined to yield four “factor scores”: verbal 
comprehension, perceptual speed, numerical reasoning, and spatial visual- 
ization. Validity data in the manual are limited to correlations with other 
tests and with school grades. 

The World Book Company brought out, also in 1955, the Holzinger- 
Crowder Uni-Factor Tests (30). The number of subtests is again nine, but 
the four factors measured by this battery are designated as verbal, spatial. 
numerical, and reasoning. Evidence on validity is again limited to corre- 
lations with other aptitude measures and with success in academic pro- 
grams. The tests in this, as well as in the previously described baiteries. 
are in large measure replicas or minor modifications of old friends that 
have been doing service for a good many years. 

A slightly different approach is found in the Flanagan Aptitude Classifi- 
cation Tests, published in 1953 by Science Research Associates (20). This 
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set of 14 subtests aspires to measure relatively distinct “job elements.” 
They range rather more widely outside the cognitive field than those 
described in previous paragraphs, being in that respect somewhat like the 
CATB, and include two tests of eye-hand coordination as well as perceptual, 
memory, visualizing, verbal, and numerical measures. The components to 
be measured by the tests were reported to have been selected in part from 
analysis of recurring components of jobs, as well as from factor analyses 
of test correlations. Some data on the validity of these tests for success 
in certain occupational groups were provided by Volkin (80) and are 
discussed more fully later. However, the recommendations for weighting 
and using the tests to predict occupational success go far beyond the 
evidence provided by Volkin. Specific weighting schemes are provided for 
30 occupations, based in large part on the author’s professional judgment 
and interpretation of the general background literature. The manner of 
presentation of these recommendations may well give users an impression 
of established validity which is quite inappropriate. 

The Turse Clerical Aptitudes Test (79) has, as its name suggests, some- 
what more limited objectives. It provides two scores: learning ability and 
clerical speed. Both are fairly conventional in character, and little evidence 
is offered on the validity of either for clerical jobs. Brief mention may also 
be made of Wesman and Doppelt’s Personnel Tests for Industry: Verbal 
and Numerical (83), which are tests of verbal and numerical ability 
pitched at a level for industrial use. 

Several aptitude tests for specific purposes may merit examination, 
further research, and possibly use in personnel selection. Lindgren, Gilberg, 
and Crosby (45) described a test based on reading of passages in Dutch 
and Early Middle English designed to be used as a high level test of 
scholastic aptitude. Dunnette (19) reported on the Minnesota Engineering 
Analogies Test, a verbal analogies test based in part on verbal relations 
and in part on engineering achievement, and gave some evidence of the 
instrument’s validity against grade-point average. Clark and Malone (15) 
described a test of topographical orientation designed for use with naval 
aviators. Chriswell (14) gave some data on a test of structural dexterity 
in which the examinee uses several lengths of metal pins and bars to re- 
produce three-dimensional structures shown in perspective drawings. The 
test yielded promising validities against several criteria of success in a 
machine shop course. Bruce (10) reported the development of the Sales 
Comprehension Test, the items of which differentiated sales from non- 
sales personnel and the scores on which had negligible correlations with 
measures of verbal intelligence. 


Studies of Published Aptitude Batteries 


Mapou (50) described the development of new norms for the General 
Aptitude Test Battery based on a stratified sample of the working popu- 
lation, Tho some differences from the earlier norms were statistically 
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significant, the changes were not deemed to be large enough to have much 
practical importance. Storrs (69) reported correlations between scores 
on the GATB factors and the Wechsler-Bellevue verbal, performance, and 
full-scale scores for a group of college freshmen. Jex and Sorenson (35) 
studied the GATB as a predictor of college grades. They reported correla- 
tions with first-quarter grade-point averages for 776 men and 515 women 
at the University of Utah. The highest correlations for the men were .43 
with the G (general intelligence) and V (verbal) factors, while for the 
women they were .41 with G and .40 with Q (clerical perception). 

A typical study of the relationship of GATB scores to success in a job, 
in this case the job of tabulating machine operator, was reported jointly 
by the Minnesota and U. S. Employment Services (52). The proposed 
minimum scores for this job are G: 95, N (number) : 95, S (space) : 85, 
and Q: 100. Schenkel (62) found somewhat higher correlations, with G 
and N having the highest validities and the test of finger dexterity next 
highest. 

The most important addition to the literature on the Differential Apti- 
tude Tests during the period covered by this Review was the new manual 
released by the authors (3). This is an impressive document bringing 
together the mass of statistical material about the tests that has previously 
appeared in separate research reports, as well as providing discussion of 
the various issues that arise in interpreting the results from these tests. 
Some of the material entering into the revised manual also appeared, tho 
in a form which is rather difficult of interpretation, in an article by 
Doppelt and Wesman (18). 

Williams (85) reported correlations of DAT verbal and abstract reason- 
ing scores with measures of general intelligence and of school achievement 
for 50 Grade X girls. Tho the verbal test correlated higher with other 
measures of intelligence, the abstract reasoning test correlated higher 
with a number of the school grades. Wolking (86) compared the correla- 
tions with school grades yielded by the DAT and the Tests of Primary 
Mental Abilities. The study was based on 267 eleventh-grade pupils. Thirty 
comparisons were made where the two tests provided measures of similar 
functions. All significant differences (nine at the .01 and four at the .05 
level) favored the DAT. 

In addition to the study just reported, several others deal with the Tests 
of Primary Mental Abilities. Marquis (51) reported correlations of the 
PMA with reading ability. Hutcheon (31) found rather inconclusive results 
in studying the PMA profiles of several groups of mental defectives. McKee 
(48) investigated the usefulness of the PMA for bright children from about 
four to eight years of /age. Moody (53) reported strikingly high correla- 
tions of the 4-minute V erbal-Meaning Test of the PMA with both group test 
results and school marks for a rural secondary-school group. Herzberg and 
Lepkin (28) found significant sex differences on four out of five factors 
for a group of high-school seniors and suggested the need of separate sex 
norms, 
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Schmidt and Rothney (63) studied the occupational choices of tenth- 
grade boys in relation to their PMA scores. From various sources, a 
determination was made of the abilities deemed called for in the 10 most 
frequently chosen occupations. The mean scores of those who chose and 
those who did not choose a given occupation were then compared. In 
only 5 out of 20 cases were the boys choosing an occupation significantly 
above average in the abilities presumably required for that occupation. 

Mosel (54) reported evidence based on relationships to amount of school- 
ing and to intelligence test score that suggests that the Tests of General 
Educational Development are more nearly measures of intellect than of 
educational achievement. 


Predicting Success in Professional Training 


During the period covered by this REVIEW, a number of reports appeared 
on the accounting testing program being carried on under the auspices of 
the Committee on Accounting Personnel of the American Institute of 
Accountants. Kane and Jacobs (37) and Kane and Traxler (38) reported 
correlations of about .50 between score on CPA examinations and earlier 
performance on the Accounting Orientation Test. Jacobs (32) reported 
correlations near .40 between this test and grades in college accounting 
courses, finding, contrary to the results of Hendrix (27), that the specially 
prepared Accounting Orientation Test yielded higher validities in most 
institutions than the American Council on Education Psychological Ex- 
amination. Traxler (76) reported correlations between the high-school 
edition of the Accounting Orientation Test and grades in bookkeeping for 
a small group, obtaining values of about .60. He (75) also found correla- 
tions of .65 to .70 between this test and the Otis Quick-Scoring Mental 
Ability Test and ACE, and similar values for other tests of scholastic aptitude. 

Layton (42) reported quite modest validities for the tests of the nation- 
wide dental school testing program, as these were applied in the Uni- 
versity of Minnesota. 

During the last three years a number of studies reported correlations 
between aptitude measures and grades in engineering school (2, 13, 16, 
41, 46, 49, 64, 67, 81). The reader who examines these studies will find 
that many different cognitive tests showed significant relationships with 
engineering school grades. However, since the tests in the different studies 
were generally not the same and since typically no attempt was made to 
correct for (or even to report) the curtailment of the groups by one or 
another type of selection procedure, it is almost impossible to make any 
critical comparison of the results from the different studies. In particular, 
it is practically impossible to judge whether the special aptitude batteries 
developed specifically for engineer selection perform this function any 
better than general tests of scholastic aptitude. 

A report by Layton (41) described the development and validation of 
a special aptitude test for engineers, involving primarily measures of 
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certain types of spatial visualizing. Webster, Winne, and Oliver (81) 
attempted to predict job success, as represented by a paired-comparison 
rating of men in small groups, and found that neither Bennett’s Test of 
Productive Thinking nor the Miller Analogies Test gave significant corre- 
lations. 

Gaier (22) made a study of medical school grades and found that most 
of the communality among them could be accounted for by a single factor. 
The Moss Medical Aptitude Test showed slightly lower correlations with 
grades than did premedical grade-point averages. Glaser and Jacobs (24) 
reported correlations of a number of tests with first- and third-year medical 
school grades. Highest correlations were yielded by the Test of Reading 
Materials in the Natural Sciences of the GED series and the Premedical 
Sciences Test of the ETS Professional Aptitude Test. 

Petrie and Powell (58) studied nurses in training in England, correlat- 
ing a number of tests with criterion ratings. Highest correlations were 
found for the Minnesota Clerical Test and educational status. 

_ Layton (43) studied correlations of the Jowa State Veterinary Aptitude 
Test and several other measures with grades in veterinary medicine at the 
University of Minnesota. For this group, the aptitude test yielded a cor- 
relation of only .25 while preveterinary grades gave a correlation of .53. 


Predicting Academic Achievement 


Studies of the relationship of general tests of scholastic aptitude or 
achievement to school grades appeared, but reference is made only to a 
few in which specialized tests were used or more analytic studies made. 

Jensen (34) reported, somewhat incompletely, the validity of under- 
graduate grades, Miller Analogies Test, lowa Mathematical Aptitude Test, 
and Cooperative Reading Comprehension Test for first-year graduate 
school grades in education, English, chemistry, and psychology. Buckton 
and Doppelt (11) found that the entrance examinations at Brooklyn Col- 
lege were substantially better predictors of later professional school en- 
trance examination scores than were college grades. Roesslein (60) re- 
ported differences in the factor pattern for a set of intelligence test sub- 
tests in the case of low- and high-achieving boys. However, no evidence 
was provided that the differences were statistically significant. Traxler 
and Townsend (77) showed that the difference between the verbal and 
quantitative subscores of the Junior Scholastic Aptitude Test had moderate 
validity as a predictor of differences between achievement in English and 
mathematics. Lorge and Diamond (47) reported correlations indicating 
substantial relationships between scores on an English placement test and 
subsequent English course grades for foreign students. 


Validation of Tests Against Job Criteria 


As in any other three-year period, there have been a number of studies 
reporting correlations of tests with various criteria of job success. As has 
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been true of previous studies, many of the present crop suffer from the 
smallness of the sample, the questionable reliability (and perhaps rele- 
vance) of the criterion measure, and unknown amounts of curtailment 
of the samples studied due to the selection procedures that were in effect. 
These limitations are, in part at least, inevitable in practical personnel 
situations, but they make the conclusions from any single study tentative 
in the extreme. 

The fragmentary nature and inconclusive character of the separate 
research studies lend interest to Brown and Ghiselli’s continuing attempts 
to integvate the findings of many separate researches. One report (8) in 
this period compared the validity of different sorts of tests for training and 
for on-the-job criteria, coming to the conclusion that there is little cor- 
respondence between validities against the two categories of criteria. 
Another report (9) started with 18 broad categories of tests and 21 broad 
groupings of jobs, assembled the available validities for each cell, and 
attempted to group the tests into still broader categories in terms of their 
pattern of validity. 

An enterprise of general interest for those concerned with aptitude test 
validities is the Validity Information Exchange. This is a feature of the 
journal, Personnel Psychology. In each issue, beginning in 1954, appear 
brief outline reports of validity studies. This source should be scanned by 
anyone interested in data on a particular job. 

Mention might be made at this point of the considerable number of 
validation studies that are carried on in the Armed Forces, and that are 
reported only in government report series of limited distribution. These 
have not been included in the present Review because of their inacces- 
sibility both to reviewer and potential user. The items are abstracted at 
least in part in Psychological Abstracts. A research worker interested in 
a job specialty that has a military counterpart would do well to get in 
touch with such agencies as the Personnel Research Branch of the Adjutant 
General’s Office, the Research Division of the Bureau of Naval Personnel, 
and the Personnel Research Laboratory of the Air Force Personnel and 
Training Research Center to determine whether relevant material is avail- 
able from these sources. 

Two studies of the validity of How Supervise? for persons in super- 
visory positions (12, 82) agreed in giving validity coefficients of about 
.25. A supervisory test was described by Jones and Smith (36), but with 
no cross validation of the selected items on a new group. 

Several studies of clerical tests for bank employees were reported by 
Seashore (65), who also discussed the difficulties of getting meaningful 
research results in even a large-scale testing program. Wilkinson (84) 
reported that the Short Employment Tests gave appreciable correlations 
with level of job but little correlation with ratings for clerical employees 
in an insurance company. Hay (26) found, for a small group of routine 
clerks, that tests of clerical perception showed relatively high validities; but 
intellectual measures none. Kriedt and Gadel (39) found that intelligence 
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had a negative weight in predicting turnover of routine clerks, while 
certain biographical data carried the main load of prediction. Negative 
results were reported for the Detroit Clerical Aptitudes Examination (17) 
and for typing and stenography test scores (56) as predictors of ratings 
in clerical jobs. Gadel and Kriedt (21) reported modest correlations of 
mechanical comprehension, arithmetic reasoning, and letter-digit coding 
with job satisfaction of IBM operators. For a group of draftsmen studied 
by Perrine (57), the only aptitude test showing significant validity was 
the DAT Space Relations Test. 

Among the various studies of different job specialties reported during 
the period were one of plant protection officers by Ash (1), two of taxicals 
drivers by Brown and Ghiselli (6, 7), one of camp counselors by Gilbert 
(23), one of Air Force weather forecasters by Jenkins (33), one of Air 
Force pilots by Levine and Tupes (44), one of automobile salesmen by 
Tobolski and Kerr (74), and one of Air Force mechanics by Wood (87). 


Follow-up Studies of Aptitude Testing Programs 


The development of comprehensive aptitude batteries and their adminis- 
tration on a broad scale either for norming or in guidance or classification 
programs has provided the raw material to permit extensive follow-up 
studies of test results in relation to later educational and work histories. 
One such study of the Differential Aptitude Tests was included in the 1952 
edition of the manual (3). A program of testing with a research aptitude 
test battery in Grade XII of the Pittsburgh public schools gave rise to two 
research reports. Latham (40) found no correlation between rated job 
success and the congruence of the individual’s aptitude pattern to the job 
he entered. Volkin (80), in contrast, found quite substantial correlations 
between certain aptitude test scores and rate of progress in specific work 
areas. Thus, for 275 students who had gone into some type of clerical work 
the highest correlation was .435 with a test of table and scale reading. 
For 170 who had gone into retail selling, mostly as sales clerks, the highest 
correlation was .560 with a test of verbal expression, that is, correctness of 
usage. 

Thorndike (70, 71) reported results on what was designed as a pilot 
study in which men tested during World War II with the Air Forces air- 
crew test battery were followed up 10 years later. In the pilot study, records 
were sought for a sample of 1500 cases. Reports were obtained for about 
900 of these men, indicating the nature of present job and some evaluation 
of success in it. A study presently under way will extend the investigation 
to 15,000 cases. These men were tested with a battery of some 20 aptitude 
tests in 1943, and the measure of job success is being obtained 12 years 
later. 

An additional study was that of Berdie (4), in which a 10-year follow-up 
was made of University of Minnesota students tested in 1939. Those obtain- 
ing degrees in different departments were studied to see how sharply they 
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were differentiated by measures of different types. The sharpest differentia- 
tion was by interest tests, next by achievement tests, next by aptitude 
measures, and least by appraisals of personality. 

It is to be hoped that the widespread use of aptitude test batteries during 
the past 15 years will eventually yield still other comprehensive follow-up 
studies. 


Studies Dealing with Aspects of Method 


A study by Hilton and others (29) explored the validity of personnel 
assessments by psychologists. In view of the large number of psychologists 
engaging in personnel audits and executive assessment, research on the 
enterprise is badly needed. This study correlated ratings based on the psy- 
chologist’s report with criterion ratings on the same traits obtained from 
the client company. Analyses showed the ratings to have some validity 
(correlation of about .25) but no specificity in that any trait rating cor- 
related about equally with any criterion rating. 

Ryans (61) studied a test of professional information used for teacher 
selection, carrying out item analyses against both an internal test criterion 
and an external job criterion. He concluded that the item-job correlations 
were all so low as to make item selection on this basis meaningless. It 
should be noted, however, that the situation in which this was true was 
one in which the total test had decidedly modest validity (correlations 
between .13 and .23). 

Thorndike and Hagen (72) explored the methodological problems of 
carrying out a national aptitude census by testing a sample of adults in 
their homes. Some substantive results were presented (25) for testing 
carried out in this way in limited sample areas. 

Tucker (78) made an empirical comparison of multiple-regression and 
unique-pattern technics for combining test results. Tho his findings gen- 
erally indicated that multiple regression procedures gave the best results, 
he also found that a very crude division of test score into four, three, or 
even two levels and the assigning of appropriate predicted scores to the 
resulting 9, 16, 32, or 64 cells resulting from the combination of different 
levels on the several different tests gave scores which correlated almost 
as well with the criterion as did the composite score obtained from the 
regression equation. 

Mosel (55) examined the use of single-item tests as rough screening 
devices for use against job criteria. The results reported look surprisingly 
good, but the approach needs further logical and empirical exploration 
before accepting it as having value. 

Boulger (5) applied Mahalanobis’ Generalized Distance Function to the 
task of differentiating employed workers in three different occupations. 
The method appears similar to that described by Tiedeman, Bryan, and 
Rulon (73) in relation to Air Force jobs. 
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CHAPTER III 


Development and Applications of Structured Tests 
of Personality 


EDWARD J. FURST and BENNO G. FRICKE 


Tus chapter deals with personality, adjustment, and interest inventories. 
Essentially such devices employ nonintellectual verbal stimuli to elicit 
responses which can be scored with a stencil. They are nonprojective in 
the sense that users can agree completely on the individual’s score; they 
are projective in the sense that individuals can project personal meanings 
into the stimuli. It was this definition of structured tests of personality 
that guided the reviewers in locating pertinent research. Out of the more 
than 450 publications located, the reviewers selected about half. Preference 
was given to original contributions, to representative studies, and to the 
last publication of a series. These criteria served to eliminate references 
which merely supplied data on norms, reliability, intercorrelation of scales, 
and the concurrent or status validity of well-known inventories. 

During the period under review, the construction and use of personality 
inventories apparently increased in both a relative and an absolute sense. 
Inventories are playing a prominent part in many research projects; for 
example, over 80 doctoral theses depended heavily upon inventories. They 
are also becoming popular in school testing programs. Berdie (16), for 
example, reported that almost one-third of Minnesota high-school seniors 
are tested with the Strong Vocational Interest Blank (SVIB). Yet despite 
this trend, the usefulness of these devices continues to be debated (33, 55, 
56, 62, 170). 

Emphasis continued to shift from the measurement of traits, such as 
“dominance” or “introversion-extraversion,” to the measurement and pre- 
diction of behavior in such areas of social significance as delinquency 
and academic adjustment. Empirically constructed inventories were used 
much more than those which were rationally constructed. The basic dif- 
ference between these two types lies in the manner in which the items are 
selected and scored. In empirical tests, items are retained and scored on 
the basis of their power to differentiate between groups of known char- 
acteristics; in rational tests, items are selected and scored on the basis 
of categories predetermined by the test constructor. 

Some inventories (9, 45, 60, 73, 85, 113, 165) which recently were 
made available use items composed of several statements. For the test 
taker the emphasis in these compound items is on the selection of the 
appropriate statement whereas in the simple item the emphasis is on the 
selection of the appropriate response category. There was considerable 
interest in the prevention, detection, and correction of conscious and un- 
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conscious distortion of inventory responses. Such efforts are important, 
but it is not yet clear whether they represent the best way to improve the 
usefulness of the inventories. Some important studies which deal with the 
examination, at one time, of several bits of information appeared. Gen- 
erally the treatment of test scores is referred to as pattern analysis and 
the treatment of item responses is referred to as configural scoring. 

An overwhelming majority of publications presuming to be validity 
studies reported significant findings. It was difficult, however, to get a 
clear picture of the validity of the inventories. Frequently investigators ran 
a whole series of significance tests involving one or more inventory scales 
and, finding that one or two out of 20 differences or relationships reached 
an acceptable level of probability, claimed some validity for the inventory. 
What they apparently failed to realize is that such a series of significance 
tests is also subject to sampling considerations, and that the relatively few 
successful instances may have arisen purely by chance. In any case, the 
need for cross validation and replication is great. This repetition on a new 
group is essential in studies where no specific hypothesis is under investiga- 
tion. For example, there have been numerous validity studies in which 
scores for the SVIB, Minnesota Multiphasic Personality Inventory 
(MMPI), and Kuder Preference Record—Vocational (KPR—V), have 
been correlated with academic achievement; usually the investigator holds 
a general null hypothesis that there is no relationship between test score 
and criterion. In most studies some “significant” relationships are found. 
The results from this first sample should not lead to conclusions but to 
specific hypotheses to be tested on a second similar, but independent, 
sample. Spilka, Hanley, and Steer (190), for example, found a highly 
significant correlation of —.45 in the first sample drop to —.01 in the 
second sample. 

Most of the studies dealt with concurrent or status validity; very few 
were concerned with the predictive validity of the inventory. While it is 
true that almost all inventories are able to reveal present status or mem- 
bership in a group, the significance of this information is not always 
obvious. The study by Shaffer and Kuder (182) was typical of many 
others. They compared KPR—V scores of law, medical, and business school 
graduates and found considerable differentiation. The reviewers would 
hypothesize that much better separation could be achieved by asking the 
subjects one question: What is your occupation? It would, however, be 
important to demonstrate that future lawyers or doctors can be identified 
from tests administered in high school. This has rarely been done. The 
methodology of Van Zelst and Kerr (212). who examined items rather 
than test scores, was similar; they found two-thirds of their self-descriptive 
adjectives discriminated between producers and nonproducers of inventions 
and technical publications. It would be desirable to study beginning scien- 
tifie and technical personnel for predictive validity in the items. 

There was less interest in the determination and discussion of reliability 
and more concern about validity and criterion measures. It appears that 


27 














REVIEW OF EDUCATIONAL RESEARCH Vol. XXVI, No. | 





research workers are increasingly recognizing that the low validity of 
some scales is due to factors other than low reliability. Most inventories 
have adequate reliability. 

A considerable number of researches exhibited a lack of imagination. 
It appeared to the reviewers that far too many investigators built their 
studies around a tool and easily available data rather than around a 
significant and well-conceived problem or theory. This generalization 
certainly fits many of the doctoral theses which made use of inventories, 
Apparently we have confirmation here for the widely accepted notion that 
man is a tool-using animal. 


General Reviews and Discussions 


A varied and imposing array of books appeared during the period. 
Most of these dealt only in part with inventories; exceptions were two 
monographs on the MMPI (99, 100). In a text, Ferguson (68) gave what 
appears to be the first reasonably comprehensive account of structured 
' personality tests. Vernon (213) also made a general review of personality 
tests and assessments. 

Eysenck (63, 64) produced two outstanding books in which he presented 
his views along with factorial studies emphasizing three dimensions: 
neuroticism, introversion-extraversion, and psychoticism. The intensive 
study by Kelly and Fiske (111) involved several inventories and was 
referred to by McNemar (133) as the major prediction project of our time. 
Stephenson (192) produced an original and stimulating book in which he 
treated under the rubric of “Q-Technique” a variety of technics concerned 
with intra-individual comparison of stimuli. In a critical review of 
Stephenson’s book, Cronbach and Gleser (49) stressed the limitations of 
Q-methodology. 

Meehl (138) found the time ripe to weigh the evidence on clinical vs. 
actuarial prediction. His review showed that the simple actuarial use of 
one or two test scores is generally better in predicting or revealing a 
person’s status than judgments made by persons trained and experienced 
in the assessment work they do. These findings are especially important 
because normally it has been assumed, quite plausibly, but incorrectly, 
that a person with considerable synthesized information (such as family 
background, school and work experience, interests, interview behavior, and 
even test scores) could make a better appraisal and prediction than could 
be made mechanically by a clerk using only test scores and an actuarial 
table. 

The recent Technical Recommendations for Psychological Tests and 
Diagnostic Techniques (4) should help to forestall the development and 
use of inferior inventories, and generally to raise standards of test pub- 
lication. Out of this report came emphasis upon “construct validity,” a 
concept which Cronbach and Meehl (50) attempted to explain and 
elaborate more fully. 
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Two general reviews of personality inventories appeared. That by Tiede- 
man and Wwson (205) for the previous three-year period is the more 
comprehensive. That by Ellis (62) covered research with personality in- 
ventories between 1946 and 1951. Ellis again took a dim view of these 
devices, expressing the feeling that they are not worth the time and trouble 
altho his survey revealed that 58 percent of the studies reported significant 
group discriminations. Calvin and McConnell (33) took issue with Ellis. 
Since he had mentioned specifically only one inventory, the MMPI, they 
checked the recent MMPI literature and found that 89 percent of those 
studies had reported significant discriminations. From his review of re- 
search on structured and unstructured personality tests, Schofield (175) 
concluded that the latter have yet to demonstrate the validity which the 
former already have. 

Altho not dealing solely with inventories, certain chapters in the Annual 
Review of Psychology (193, 194, 195) treated this subject. Reviews of 
special topics also appeared. Cottle (46) made a thoro review of MMPI 
studies. Gaier and Lee (75) reviewed pattern analysis studies. Fricke (72) 
made a critical review of 22 inventories designed to predict college achieve- 
ment. Test-retest studies were the object of Windle’s survey (216). He 
concluded that changes were slight, and that generally the second scores 
indicated better adjustment. Ghiselli and Barthol (81) examined 113 studies 
and found the median validity coefficient for personality inventories in 
business and industry to be about .25. From their review, Barnett, Stewart, 
and Super (10) concluded that the Occupational Level Scale of the SVJB 
was a measure of status of interests and not of drive. 

Among the theoretical discussions, Loevinger (126) proposed five 
principles of personality measurement; these led her to prefer the struc- 
tured, two-response category type of test item. Butler (29) felt that the 
common “mental test” models are not fruitful when applied to the devel- 
opment of personality inventories. He therefore argued for the use of 
formal psychological models which are more likely to lead to a feedback 
between theory and experiment. Guilford (96) convincingly reviewed rea- 
sons why scores from some inventories should not be factor analyzed. 
Finally, Kuder (117) listed important developments which he expected 
to see in personality and interest inventories during the near future. 


New Inventories 


Some two dozen new inventories, about evenly divided between those 
constructed empirically and those constructed rationally, appeared. Gen- 
erally the former attempt to reveal a lot about behavior in a few situations, 
and the latter, a little about behavior in many situations. 

More than half the empirical inventories focused on the measurement of 
nonintellective factors associated with academic achievement. The com- 
mon approach in selecting items for a scale was to contrast the . re- 
sponses of high- and low-achieving students. Most inventories, unfor- 
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tunately, have used obvious study-habits questions or statements which 
are easily faked. In addition, students have not been tested before they 
began their classroom experiences, and scores have rarely been correlated 
with other easily available predictors, such as high-school rank and scores 
on ability tests, to demonstrate their value. Michael and Reeder (141) 
phrased their study-habits items in a positive manner to avoid unneces. 
sary ambiguities. They found some status validity for a second (cross. 
validation) sample of psychology students. Bond (19) and Woodman 
(217) did not check their results on cross-validation samples. Schult 
and Green (178) concluded their three-year project by using an item selec. 
tion technic devised by Gulliksen; they developed several grade predictor 
scales which had considerable validity in the item-analysis sample bu: 
a correlation of only about .12 in the cross-validation sample. Gough (88) 
gave a limited description of the construction of a scale to predict grades 
in psychology. The median validity coefficient in cross-validation samples 
was about .32. Brown and Holtzman (25, 26) and Holtzman, Brown, and 
Farquhar (106) described the construction of the Survey of Study Habit; 
and Attitudes (SSHA). Considerable effort was made to eliminate ambigu. 
ous, even tho discriminating, items. It is of some interest to note that at least 
one investigator (72) sought ambiguous items to enhance the projective 
features of verbal stimuli. The validity of the SSHA in cross-validation 
samples was about .41 but, since the students were not tested prior to 
their college experience, this coefficient is difficult to interpret. A cor. 
relation of about .25 between SSHA and ACE was reported. Fricke (72. 
73) described the construction of the Opinion, Attitude and Interest 
Survey (OAIS) for which two scales were cross validated to measure 
what he called “academic personality.” The academic personality scores o! 
students who were tested six weeks before the beginning of college classes 
correlated about .40 with first-quarter grades and .12 with the ACF. 
Malloy (136) and Neidt and Malloy (153) described the Life Experience 
Inventory for which two grade predictor scales were constructed, one 
based on the criterion of grades and the other on the variance in grades 
unpredicted by the ACE and an English test. The two were equally valid 
in predicting grades, but only the latter scale contributed significantly in 
a multiple-regression equation. 

Cook, Leeds, and Callis (44) presented the Minnesota Teacher Attitude 
Inventory (MTAI) which was designed to indicate whether or not a 
teacher has or will have good rapport with pupils. Several investigators 
made use of this inventory. Callis (30) found the MTAI/ a valid indicator 
of teacher-pupil relations in Grades IV to X. The teacher’s score correlated 
especially well with ratings made by pupils but insignificantly with princi- 
pal’s ratings. Della Piana and Gage (51) reported an important validity 
study which showed that the MTAI was best for students with high at- 
fective (as opposed to cognitive) needs and orientation. Downie and 
Bell (53) found MTAI scores were related to ACE, grade achievement. 
and case history information of education students. Mitzel and Aikman 
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(145) found no relation between MTAI and satisfaction with student 
teaching. 

Cottle, Lewis, and Penney (48) reported a pilot study of a scale to 
distinguish counselors from teachers. Cottle and Wands (47) used the 
scale to distinguish high-school counselors from high-school teachers. 
Koile (114) described the construction of the Professional Activity In- 
ventory for College Teachers, which separated counseling teachers from 
noncounseling teachers very well in a cross-validation sample. 

Gough (87) constructed the California Psychological Inventory (CPI) 
which uses many items from the MMPI and similar items. The CPI con- 
tains over 15 empirically validated scales, one of which is the intellectual 
efficiency scale. This scale (89) was constructed by contrasting the per- 
sonality test responses of students with high and low IQ’s. Scores on this 
scale correlated .47 with IQ’s. 

Feinberg (65) described the construction and usefulness of the Per- 
sonal History Questionnaire for revealing social acceptance in young high- 
school students. He found economic level to have little bearing on item 
validity. 

Thirteen new rationally constructed inventories were developed. They 
ranged over the areas of adjustment, attitudes and opinions, empathy, in- 
terests, personality in general, and values. 

The only new adjustment inventories, oddly enough, were designed 
for use with elementary-school children. Remmers and Bauernfeind (163) 
modeled the SRA Junior Inventory after the SRA Youth Inventory. Baker 
(9) devised the “Telling What I Do” Tests for use in Grades IV thru 
VI and VII thru IX. 

Attitudes and opinions have taken on a new significance, for educators 
and psychologists have begun to turn to these as indicators of personality 
dynamics. The California Fascism Scale, described by Adorno and others 
(1), helped to stimulate this interest. This so-called F Scale continued to 
be used (189, 215) altho it may give way to other instruments, such as the 
Inventory of Beliefs and Problems in Human Relations (45), developed 
in the cooperative Study of General Education. The /nventory of Beliefs 
seems to offer a promising means of exploring relations between psychologi- 
cal dimensions, such as stereotypy, and outcomes of general education. 

An intriguing technic for measuring attitudes was described by Kubany 
(116). The items embedded in an information test require a choice between 
two answers both of which are exaggerations; the subject’s choice re- 
veals his attitude. 

The Empathy Test constructed by Kerr and Speroff (113) led to a few 
studies, the results of which were conflicting. In an industrial situation, 
Van Zelst (211) reported a validity coefficient of .59 between scores on 
this and ratings of the interpersonal desirability of skilled workers. He 
suggested that this test might be useful with counselors and teachers. 
However, Siegel (186) found no differences in the scores of clinical and 
experimental psychologists; he assumed that the former have superior 
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empathic ability and that a difference should be found if the test were 
valid. 

In the area of interests, Long (127) described the Job Preference Survey 
which he designed to measure the interests of the unskilled and semi. 
skilled. Roeber and others (165) constructed the Vocational Interest 
Analyses consisting of six separate tests each containing 120 paired items; 
the tests purport to measure interest in the six areas covered by the Lee. 
Thorpe Occupational Interest Inventory (Lee-Thorpe Oll). 

Two personality inventories showed the influence of Murray’s need sys- 
tem, an influence long overdue. Edwards (60) built the Edwards Per. 
sonal Preference Schedule (EPPS) around 15 needs in the Murray sys. 
tem. He paired the statements for social desirability. This was done be- 
cause the correlation between a subject’s judgment of the desirability 
of an item trait and the probability it will be selected (preference value) 
was found to be .87 (61). Using two samples of nurses, Navran and 
Stauffacher (151) found the EPPS not valid in measuring need self- 
descriptions. Schaffer (172) devised a questionnaire to measure 12 psy- 
chological needs and presented intercorrelations of the scales. He stated 
that the most serious question concerns the validity of the need-strength 
scales. 

Another inventory, the Gordon Personal Profile (85), makes use of 
a four-statement, forced-choice type of item and yields several factored 
trait scores. Cattell (37) reported an extensive factor analytic study 
which presumes to have distilled the essence of important personality 
factors measured by inventories. The lone instrument intended to measure 
values was devised by Shorr (185), who did not present validity data on 
his four scales. 


New Scales for Existing Inventories 


Only a small number of new scales will be mentioned here. Generally 
these scales involve scoring certain items of the parent inventory with- 
out removing the items to form a shortened inventory. 

Utilizing the KPR—V, Tiffin and Phelan (206) conducted a well- 
designed study to detect high labor turnover. They compared the re- 
sponses of high and low tenure groups, and cross validated the resulting 
scale. A noteworthy feature is that they used the answer sheets of job 
applicants for the analysis. A scale to predict turnover of teachers could 
be constructed in the same way. 

The MMPI again came in for its share of new scales. Pearson (158) 
constructed an emotional immaturity scale by contrasting the responses 
of patients labelled “emotionally immature” with normals. While the 
scale successfully cross validated on similar groups, it could not dis- 
criminate emotionally immature patients from patients in general. Pear- 
son concluded that emotional immaturity was too general a description 
and did not represent an essential “core” of meanings. Barron (11) 
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constructed a scale to predict response to psychotherapy by contrasting 
the responses of 17 neurotics who improved with 16 who did not improve 
with psychotherapy. He judged the scale to be essentially a measure of 
ego strength. Cross validation yielded correlations between .38 and .54. 
Mills (144) found significant differences between test and retest profiles 
and constructed a scale to reflect pattern stability, but it proved to have no 
validity in a cross-validation group. Snyder (187) was unable to derive 
a satisfactory scale to discriminate between good and poor clinical psy- 
chology students. 

Additional MMP! scales included those for Hostility (Ho) and Pharisaic- 
Virtue (Pv). Cook and Medley (42) devised these to measure ability to 
get along well with others by contrasting the MMPI responses of high and 
low scorers on the MT AI. They presented (unlike over 95 percent of scale 
constructors) the direction of the scored responses for the two scales. 
Less than 10 percent of the items in both scales are scored “false.” It 
seems likely that an imbalance in scored responses together with a test 
taker’s response set to select a particular response category may explain 
many correlations. A high positive correlation between Ho and Pv might 
thus be expected; Cook and Medley reported a correlation of +.69 for 
200 graduate students. It is also of interest that these scales correlated 
about —.45 with MT AI scores, a large majority of the items of which are 
scored in the disagree (false) direction. 

Strong and Tucker (200) described four medical specialist scales and 
the revised physician scale. Altho the specialist scales appear to have con- 
siderable status validity in separating medical specialists from physicians 
in general, for college seniors the correlation between these specialist 
scales and the revised physician scale was about .90. The reliability of the 
latter scale was .87. A great deal can be learned about scale construc- 
tion and the SV/B from this monograph. 

In a further study of the SVIB, McCornack (130) cross validated 
male and female social worker scales. He found, contrary to most others, 
that multiple weighted response keys were consistently superior; how- 
ever, when scales based on 18- rather than 6-percent differences between 
criterion groups were used, the superiority of the multiple-weight key 
was very slight. 

Garman (76) selected high and low scorers on the Taylor Manifest 
Anxiety Scale (TMAS) and Winne Neuroticism Scale and contrasted 
their responses on the SV/B. The. anxiety scale for the SV/B was success- 
fully cross validated in two samples. An atypicality-of-response scale 
constructed for the SV/B also was significantly related to these two scales. 

One of the most frequently studied MMPI scales is the Taylor scale 
(203) mentioned above, which is based upon 50 MMPI items five 
clinicians agreed were indicative of anxiety. Taylor showed that the scale 
discriminated between college students and psychiatric patients remark- 
ably well. Hoyt and Magoon (109) presented TMAS scores for counselees 
judged by eight counselors to have manifested high, medium, and low 
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anxiety. Altho differences were in the expected directions, an item analysis 
showed 20 of the 50 items were not functioning. Kendall (112) investi. 
gated the validity of this scale by obtaining nurses’ ratings as to the 
extent of anxiety in TB patients. The reliability of the rating scale used 
by the nurses was about .91. For patients scoring in the top and bottom 
27 percent on the TMAS no significant difference was found in nurse-rated 
anxiety. A difference at the 1-percent level was found when the top and 
bottom 13 percent on the TMAS was arbitrarily selected. To determine 
whether this scale measured something not measured by the standard 
MMPI scales, Brackbill and Little (21) tested seven samples (normal and 
abnormal) and found a median correlation of .91 between it and the 
Pt scale. Navran (149) constructed a scale to measure dependence. Six- 
teen judges specified MMPI items indicating dependence. Factor analysis, 
internal consistency, and other procedures produced a scale of 57 items. 


Item Form, Content, and Analysis 


-Altho in the past relatively few types of inventory items have been 
used, some research is now being focused on the characteristics and 
differential value of various kinds of items. 

A miscellany of studies treated in one way or another the issue of 
simple, either-or judgment vs. qualified judgment. Tuckman and Lorge 
(209) asked students to respond “yes” or “no” to attitude statements 
about old people; they also asked students to estimate the percent of old 
people described by the same statements. Since scores from these two 
answer technics correlated .93, these investigators recommended the simpler 
method. Bauernfeind (12) argued, without supporting data, that dichot- 
omously scored items are less desirable than items which permit ex- 
pressions of strength of response. He described the role of the differentially 
sized answer boxes that are used in the SRA Junior Inventory. In an im- 
portant methodological study Neidt and Edmison (152) asked subjects 
to indicate whether they agreed with one, none, or both paired state- 
ments. They found these qualification responses were not related to 
academic achievement but that they made the subjects feel better about 
responding to the items. Rosen and Rosen (167) questioned as a result 
of their study whether the “undecided” category is worth using since 
a respondent’s use of this response category is difficult to interpret. 

Other studies compared the forced-choice technic with technics calling 
for an absolute judgment of single statements. The evidence was somewhat 
conflicting. Heineman (102) presented a study of the construction of 
a forced-choice anxiety scale in which MMPI items were matched for 
social desirability. The forced-choice scale appeared to be less subject 
to distortion. Perry (159) found forced-choice items superior to like- 
indifferent-dislike items for the measurement of vocational interests. Way 
(214) found the KPR—V did not correlate very highly with a modified 
inventory using the same Kuder statements but permitting a choice of 
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five responses to indicate intensity of interest. Intercorrelations of scales 
on KPR—V were low, and on the modified inventory they were high. 
Osburn and others (156) found, contrary to most others, that the forced- 
choice item was less useful than the single statement item. 

Among the other articles on the forced-choice technic, Brogden (24) 
presented data to show that if forced-choice items are constructed by 
pairing statements according to susceptibility to distortion, rather than 
social desirability, the resulting scales will be more valid. Denton (52) 
and Lanman and Remmers (119) discussed preference and discrimina- 
tion indexes. Ghiselli (80) ably clarified the important concepts in the 
forced-choice technic. He suggested that elaborate procedures for deter- 
mining preference value are unnecessary. 

The relative usefulness of subtle and obvious items was the concern 
of several investigators. Seeman (180, 181) administered MMPI items 
he judged to be subtle and obvious to clinical psychology students with 
the request that they indicate how each is scored. His findings were 
amazingly clear-cut; the students were unable to specify the scale on which 
the subtle items appeared or the scored direction (true or false). Similarly, 
Garry (77) found that subtle SV/B items were difficult to fake. Apparently 
subtle items, because of their greater ambiguity, offer a way to reduce or 
prevent faking. Further evidence suggested also that they may prove more 
valid in other respects (78, 86). Of special note here is the fact that subtle 
MMPI items were found more sensitive to therapeutic progress (174). 

Three other studies were primarily concerned with the content of the 
items. Bridge and Morson (22) asked 38 raters experienced in voca- 
tional counseling, placement, and job analysis to classify items in the 
Lee-Thorpe Occupational Interest Inventory according to the areas pre- 
sumed to be measured. The raters were in substantial agreement with the 
test authors on only 76 percent of the items. Grace and Grace (93) 
investigated personal-centered, interpersonal-centered, and target-centered 
value statements and their significance in terms of sociometric ratings. 
The reliabilities for the value scales and sociometric ratings were very 
high, but the validities were very low. McQuitty (134) distinguished items 
having objective and subjective cues. 

Only three studies investigated interitem relationships to reveal factorial 
composition. Ferguson (67) made a factor analysis of the items in the 
MTAI and concluded that only one type of attitude is measured. Ford and 
Tyler (69) analyzed items in Terman and Miles’ M-F Test and concluded 
that psychological sexuality is not a unitary trait, but that there are at least 
two dimensions representing emotional characteristics and interests. Brog- 
den (23) reported the interitem correlations for the Allport-Vernon Study 
of Values; he interpreted 10 of the 11 first-order factors found. 

Niven (154) gave a very clear step-by-step description of the use of 
the reciprocal averages technic and a modified Guttman scale analysis 
technic to select attitude items. He found that these technics yielded total 
scale scores which correlated about .90. Grim and Hoyt (94) used the 
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method of reciprocal averages in their item analyses and described the 
properties of the set of scoring weights selected. Clark and Gee (39) found 
the method of unit weighting superior to the multiple weighting used in 
the SVB; they presented some evidence to indicate that an optimum per. 
centage discrepancy can be established for the selection of items and that 
an optimum number of items for a scale can also be determined. Levine 
(122) described a technic for selecting items to serve as a suppressor 
variable; he suggested that items could be obtained from the invalid items 
which are normally discarded after an item analysis. Test constructors 
have here a method of pulling themselves up by their bootstraps. Feldman 
(66) found that a scale based on items selected at the 5-percent level was 
as valid as a scale using 1-percent level items. He gave an example to show 
that validity shrinkage is greatest when the original criterion groups are 
small. Gordon (86) reported that different methods resulted in assigning 
the same preference value to an item; earlier he (84) found a slight change 
in the preference value of items when their position in the inventory was 
altered. Using an attitude-interest test to predict grades, Schultz (177) 
found very little relationship between item validity and the frequency 
with which college applicants select the scored response. While this may 
be comforting to personality test constructors and users, it should not be 
forgotten that the validity of this device was depressingly low. 


Distortion of Responses 


One of the major criticisms of personality inventories is that subjects 
can and do achieve scores which do not correctly describe them. (It has 
been claimed, but not demonstrated, that unstructured tests are less open 
to faking.) A number of methods have been devised to identify certain 
conscious and unconscious contaminating influences, and a few of these, 
such as the K Scale of the MMPI, can be used to correct scores. 

Taking his cue from a study in the 1930’s, Gough (91) constructed a 
dissimulation scale for the MMPI. He found that normals, some of whom 
were experts in human behavior, were unable to simulate the responses of 
diagnosed neurotics on the MMPI. For example, 14 percent of 176 neurot- 
ics said “true” to the item, “I am sure I get a raw deal from life,” but 84 
percent of 111 normals thought that neurotics would say “true.” Gough, 
in a cross validation with the 74 items showing the largest discrepancies, 
found that normals taking this special scale under standard conditions 
did not respond differently from neurotics, and that dissemblers could 
easily be identified. Apart from the usefulness of the dissimulation scale, 
this study is especially significant since it points to the dangers in per- 
sonality test construction of specifying by judgment the response charac- 
teristic of a certain group. In another study (90) Gough selected personality 
items which drew different responses when students were asked to respond 
normally and as if they were attempting to make a good impression. A high 
score indicates an attempt to make a good impression. 
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Buechley and Ball (27) reported a new internal validity indicator for 
the MMPI. They devised the Tr (test-retest) Scale which uses the 16 items 
in the MMPI which are exact repetitions. The number of times an individual 
contradicts himself is his score. Despite the awkwardness of scoring, the 
Tr Scale should be used at least when there is some question as to the 
validity of a profile. 

Two investigators devised scales to measure the response set to say 
“true.” Cohn (40) did this for the MMPI by contrasting the responses of 
those who gave an abnormally large or small number of “true” responses. 
Fricke (72) presented a simpler method for his OA/S. He showed how 
the response-set scale, T, could be used as a suppressor variable to improve 
the validity of the grade predictor scales. In view of the similarity of scales 
T of the OAIS and K of the MMPI, Fricke challenged the established inter- 
pretation of K as a measure of defensiveness. Sweetland and Quay (202) 
did likewise. They felt that it might be a measure of healthy adjustment, 
and, to support their view, they pointed to the generally high negative 
correlations between K and scores from the clinical scales. 

In a well-designed investigation, Mitzel, Rabinowitz, and Ostreicher 
(146) found that the response set to give the extreme response (strongly 
agree or strongly disagree) contributed very little to the validity of the 
MTAI. Since most of the “good” responses on the MTAI are scored dis- 
agree or strongly disagree, probably a measure of the response set to deny 
the applicability of the items could be used as a suppressor variable to im- 
prove the validity of the MTAI. For the California Test of Personality, 
Lindgren (124) devised an “idealization” scale which consists of the 30 
items most easily faked. The correlation of .85 between scores on this scale 
and total scores led him to conclude that the inventory measures the tend- 
ency to idealize one’s self. Schlesser (173) described the development of 
a “lie” scale for improving the validity of a personal values inventory 
for predicting academic success. He offered suggestions for revising items 
which invite distortion. 

That individuals can consciously distort their inventory responses in a 
given direction is further substantiated by the studies reported below. Cohn 
(40) found that students could easily fake scores on the F Scale. Accord- 
ing to Scodel and Mussen (179), low-scorers (nonauthoritarian) on the 
F Scale were more accurate than high-scorers in simulating the F Scale and 
MMPI item responses of their paired opposites. Longstaff and Jurgensen 
(128) found that students, when asked to fake a good score, did not raise 
their score on a scale for self-confidence; however, when asked to fake 
a high self-confidence score, they were able to do so. Sopchak (188) gave 
the MMPI to students four times with different instructions and found 
some interesting relationships. 

Still other studies showed that students were able to simulate the re- 
sponses characteristic of certain personality types and occupational groups. 
Thus, Rabinowitz (162) found that education students, asked to simulate 
the attitudinal orientations of permissive and authoritarian teachers, were 
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able to alter their scores on the MTAI to a marked extent. Gage (74) 
reported that predictions of KPR—V test responses of subjects on the basis 
of observation were less accurate than predictions made on the basis of 
stereotype. Durnall (58) presented a clean-cut study of faking KPR—/; 
he asked personnel students to fake the interests of an accountant-auditor. 
The faked profile strongly resembled the profile of Kuder’s accountant. 
auditor norm group. Garry (77) presented evidence to show the marked 
extent to which specific SVJB scales could be faked, and that ability to 
fake was unrelated to intelligence or information about an occupation. 

Several studies indicated that a person’s score is often contaminated by 
his set to prefer a particular response category. Those of Guilford (95), 
Rosenberg (168), and Rubin-Rabson (171), who examined use of the 
“undecided” or “?” category, are typical. A response set interpretation 
of the K Scale might account for the findings of Cook and Medley (43) 
and of Matarazzo (137). 

To test the hypothesis that considerable faking occurs in industrial selec- 
tion, Herzberg (103) examined the scores of applicants for jobs and promo- 
tions, a counseling group, and a college group. He found that the industrial 
subjects got much better scores on the Guilford-Zimmerman Temperament 
Survey (GZTS), and concluded that considerable real-life faking occurs. 
In a somewhat similar study, Herzberg and Russell (104) found that those 
desiring to change their jobs obtained KPR—V profiles different from 
others in the occupation they were leaving and similar to the profile of 
persons in the occupation they hoped to enter. 

Under some conditions, at least, distortion is not inevitable. Thus Ash 
and Abramson (7) found that scores on signed and unsigned attitude 
questionnaires did not differ. 


Pattern Analysis 


A promising trend in the use of multi-scale inventories is the considera- 
tion at one time of several scores. Special methods, some statistical, some 
empirical, and some inspectional, have been used to take into account the 
magnitude and relation of scores. Statistical approaches to pattern analysis 
recently were described (3, 75), and the last chapter of this issue also 
treats this topic. 

Evidence for the value of pattern analysis came from a study by Gough 
and Pemberton (92). These investigators found no relationship between 
single scales of the MMPI and success in practice teaching, but were able 
to identify certain patterns or indexes of scores which had considerable 
validity. A similar approach by Zwetschke (219) yielded essentially the 
same results. 

Treatment of MMPI profiles interested other investigators too. Sulli- 
van and Welsh (201) presented a technic for analyzing test profiles which 
deserves more attention. The technic involves coding and ranking a profile 
and then identifying characteristic “signs” or patterns of scores. Drake 
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(54) reported an informative study o¥ coded MMPI profiles for a college 
population. From counselor notes he selected three client groups: aggres- 
sive, shy, and nonresponsive. These groups had distinctive MMPI profiles. 
Hovey and Stauffacher (108), however, found that they as clinicians could 
describe student nurses more accurately by inspecting their MMPI profiles 
than could a “mechanical” method of using MMPI scores. This is a rare 
finding (138). 

Gilberstadt (82) investigated the usefulness of Hathaway and Meehl’s 
code comparison (cc) index on a male psychiatric population. He con- 
cluded the cc approach was probably more useful than factor analysis in 
identifying meaningful surface traits. 

Other investigators applied pattern analysis to the KPR—V. Holland 
and others (105) intercorrelated and classified by means of a cluster 
analysis a sample of KPR—V profiles. Callis, Engram, and McGowan (31) 
presented a similar system for coding KPR—V profiles for occupational 
groups. In his study of the {PR—V and MMPI prefiles of 2887 veterans 
undertaking counseling, Harmon (98) raised many practical and theoret- 
ical questions on pattern analysis. 

A research by Frandsen and Sessions (71) showed the usefulness of rank 
order correlation for treating profiles. They obtained such correlations 
between KPR—YV scores, expressed interests, and academic achievement. 


Configural Scoring 


A most significant innovation was discussed by Meehl (139), who sug- 
gested that several responses (configurations) be scored at one time. The 
assumption is that much information is lost when one item is analyzed 
without simultaneously considering information obtainable from other 
items. The simple accumulation of points by scoring singly discriminating 
items ignores interitem relationships of importance. Meehl’s theoretical 
examples show how it is possible to use two singly invalid items to generate 
substantial configural validity. Mathematical verification for Meehl’s ap- 
proach came from Horst (107) who derived an equation and provided a 
table for obtaining configural validity coefficients for responses to item 
pairs. 

Unfortunately many discriminating configurations will be missed if the 
Meehl and Horst partial configural analysis method is followed. Complete 
configural scoring was described by Fricke (72). He presented a new type 
of personality test item which involves scoring simultaneously three re- 
sponses, two of which are content responses and one an intensity response. 
This configural-content-intensity item combines the desirable features of 
configural scoring, forced choice, and intensity analysis, and appears to 
be sufficiently sensitive to permit the measurement of many unmeasured 
psychological dimensions. 

An interesting and potentially useful approach is that of McQuitty 
(135), who attempted to classify individuals according to the extent to 
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which they gave the same responses. He applied his method to normals 
and psychiatric patients with very discouraging results. Nevertheless 


his approach deserves more attention. Beir and Ratzeburg (13), Lazarsfeld — 


(121), and Milholland (142) presented additional variations of configural 
analysis which might be adapted for scoring inventory responses. 


Reliability, Normative, and Comparability Data 


Of the many studies relevant here, the reviewers have cited only a small 
number. There was evidence both for and against the stability of interest 
scores. Trinkaus (208) found a correlation of about .58 between SV/B 
scores obtained from college freshmen and retests 15 years later; the 
stability coefficients reported by Strong are considerably higher. Stordahl| 
(196) labeled, correctly, his study of the SV/B as a study of permanence 
of interest scores (not “interests,” as is too frequently done) ; presumably 
he recognized that the stability of interest test scores may be due to factors 
other than interest. For 47 scales he obtained a median two-year retest 
reliability of about .70, and he found about 64 percent of those who 
obtained an A or a C on the first administration obtained an A or a C 
on retesting. Using three methods to determine the stability of Lee- 
Thorpe OIl scores, George and Kingston (79) concluded that the scores 
changed considerably in less than two months. 

An important and somewhat different study is that by Bordin and 
Wilson (20) who used the KPR—V and found a strong relationship be- 
tween test-retest stability and retention (or change) of the curriculum 
choice of freshmen; the test scores changed for those who changed their 
majors. Layton (120) tested 15 graduate students in psychology and 
educational psychology from 9 to 18 times with the MMPI and found 
considerable intra- and inter-individual variations. Carman (35) tested 
and retested divorced and married subjects with the KPR—V and the 
Edwards Personal Preference Schedule; differences were greater in the 
sample of men. 

Typical normative studies of the MMPI were those of Applezweig 
(6), Black (18), and Goodstein (83). Their studies suggested the need for 
new norms, particularly for college-level groups. Cook and Hoyt (41) 
described a procedure for determining the kinds of norms necessary for 
the MTAI. Berdie, Layton, and Hagenah (17) presented the percent of 
twelfth-grade students who received the various letter grades on the SV/B. 
It is surprising to find that less than | percent obtain A’s on some keys 
(psychologist, mathematician, city school superintendent) and that over 
50 percent obtain A’s on other keys (office worker, farmer). 

Two comparability studies, conducted independently, yielded results 
which are of general significance for inventory construction and use. 
Charen (38) and Machover and Anderson (131) pulled out all the items 
on one MMPI scale and administered them separately. They also gave 
their subjects the standard MMPI booklet. Scores on the respective scales, 
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Hs and Pd, agreed well with the corresponding scores obtained from the 
inventory itself. If future research substantiates this stability-regardless- 
of-context, one of the presumed limitations of empirically validated tests 
will have been removed. 

It is also significant that shortened inventories may give results com- 
parable to those obtained from the full-length inventory. Thus, scores 
based on only half the KPR—V items showed a strong relationship with 
scores based on the whole inventory (34), and scores on a short au- 
thoritarian-equalitarian scale still had considerable validity (59). More- 
over, when time is limited, the rather lengthy MMPI might well be cut 
from 565 to 420 items without great loss in accuracy of scores on the 
several scales (155). 


Validity Stu‘ies 


Many studies cited up to this point were concerned with validity 
but generally in a restricted way. It is our intention now to cite studies 
in which the emphasis was upon the validity of an instrument for specific 
purposes. We have divided these studies according to whether they evalu- 
ated concurrent or predictive validity. These are mutually exclusive 
categories. If the inventory scores and criterion data were collected or 
became available at about the same time, the study could not qualify 
as one of predictive validity. We have not attempted to synthesize those 
studies which evaluated the construct validity of inventories; space did 
not allow this. 

Even tho most instruments are able to demonstrate a certain degree or 
kind of validity, few studies have demonstrated the relative usefulness of 
inventory scores when other information is easily available. For example, 
few research workers have used the one-item test, or single question, and 
compared the results with results from the inventory. This is crucial in 
studies of concurrent validity, but even in studies of predictive validity 
it is important to demonstrate that inventory results do make a unique 
contribution (129, 197, 198, 204). A sophisticated paper by Meehl and 
Rosen (140) on the role of antecedent probabilities in prediction work 
contains ideas related to points discussed here. 


Concurrent Validity 


The fact that a study did not establish the concurrent validity of an 
inventory does not mean that it made no contribution. It is helpful to 
know, for example, that the California Test of Personality was not able 
to distinguish truants from nontruants (5); and that the K-correction 
for the MMPI did not improve discrimination between high (N=17) 
and low (V=14) groups in student teaching (210). Such negative results 
may help to define the specific purposes for which an inventory is and is 
not useful. But perhaps more importantly, they can encourage research 
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workers to penetrate a problem more deeply and to explore other, more 
sensitive, technics. This generalization also holds for researches such as 
that of Rosenberg and Izard (169) who found some KPR—V differences 
in samples of graduates and drop-outs from a Naval Aviation Cadet 
training program. 

Several straightforward attempts to validate personality and adjust. 
ment inventories took ratings or similar data as the criterion. There was 
evidence that the MMPI (18, 32, 143) and the SRA Youth Inventory (57) 
correlated with such criteria. The results on the Mooney Problem Check; 
List were much less convincing, but these came from only one study 
(132). The finding that Guilford’s STDCR scores were more closely re. 
lated to self than to peer ratings emerged from Carroll’s study (36). 

Occasionally concurrent validity was established for purposes rather 
different from those intended for an inventory. Thus, Rinne (164) used 
the MMPI successfully to discriminate between interest groups, and 
Forer (70) used the KPR—V successfully to discriminate maladjusted from 
adjusted persons. Such successful attempts, however, were the exception. 

In other studies involving interest inventories, level of interest scores 
on the Lee-Thorpe Oll showed significant relations with students’ occupa. 
tional choices (191); a multiple regression equation proved useful in 
revealing the interests of carpenters (147); and the KPR—V differentiated 
high-school vocational agriculture students from the general norms (28). 
This last study was incorrectly labeled a selection study; as in many simi- 
larly labeled studies, the scores were not used to select anybody. 

Of studies which permitted the comparison of structured and projective 
personality tests, that by Ausubel, Schiff, and Zeleny (8) and that by 
Kornreich (115) were typical; structured tests were found more valid. 


Predictive Validity | 


One kind of prediction is that of curriculum choice. In a study by 
Shaw (183) the KPR—V had some validity for indicating which of four 
high-school curriculums a ninth-grader was likely to select; however, it 
had no validity for predicting grades in these curriculums. 

Another kind of prediction is that of persistence in a school or curricv- 
lum. In a study of high-school drop-outs, Roessel (166) found the MMP! 
had considerable predictive validity; only the masculinity-femininity 
scale failed to discriminate. LaBue (118) conducted a study on persistence 
of interest in teaching and found suggestive differences between various 
subgroups on the KPR—V, SVIB, MMPI, and the Bell Adjustment In- 
ventory. Two studies, however, reported negative evidence. Healy and 
Borg (101) failed to find important differences in KPR—V and Guilford- 
Martin scores for nursing freshmen who dropped out and those who com- 
pleted the first year; and Munger (148) found no relation between per- 
sistence in college and scores on the Wrenn Study-Habits Inventory and 
the Bell Adjustment Inventory. 
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As to prediction of scholastic achievement, we have already mentioned 
several approaches. A study typical of many others is that by Schofield 
(176), who found that MMPI scores for freshman medical students had 
negligible value for distinguishing the top and bottom quarters of the 
class during its third year. 

A further kind of prediction is that of occupational choice. Two investi- 
gators reported predictive validity for interest inventory scores, altho 
their results suffered from a source of contamination which makes inter- 
pretation difficult: The inventory scores had been shown to the students. 
Traphagen (207) found that 30 students who had retained high-school 
teaching as their objective after counseling, had somewhat different SVB 
profiles than students who had changed their objective; and Levine and 
Wallen (123) found that KPR—V scores obtained during high school 
had value in predicting occupational status nine years later. Navran (150), 
however, in a study of the SVJB failed to find predictive validity for the 
nursing key. 

In other studies of occupational choice, expressed interests were found 
to have considerable predictive validity. However, only one of these 
researches permitted a direct comparison of the single question vs. the 
inventory. That one was by McArthur and Stevens (129), who compared 
the expressed (interview) and measured (SV/B) vocational interest of 60 
college sophomores. Expressed interest was the better predictor of oc- 
cupation engaged in 14 years later. Strong himself (198) found the ex- 
pressed occupational choice of freshmen to be the occupation engaged 
in 19 years later by 50 percent of the students. He estimated this to be 
the equivalent of a validity coefficient of .69, which compares very 
favorably with any psychometric device. In another follow-up of his 
gifted group, Terman (204) reported on the characteristics of scientists 
and nonscientists. Of those classified in the engineering group, 58 per- 
cent had made the occupational choice “engineer” at the age of about 10 
in 1922; only 9 percent of the lawyers had chosen engineering in 1922. 
The SVB, taken in 1940 when the subjects were about 29 years old and 
occupationally experienced, placed 88 percent of the engineers and 32 
percent of the lawyers above a standard score of 35 (B) on the engineer- 
ing scale. 

In clinical prediction, the MMPI proved superior to the Rorschach 
in predicting which outpatients would later need hospitalization (160). 


Miscellaneous Studies 


Several generally significant or interesting reports have not been 
mentioned. Among the contributions to technic, Zaccaria, Schmid, and 
Klubeck (218) described a method for developing equivalent forms of 
an inventory; and Pierce-Jones and Carter (161) reported the develop- 


ment of a photographic interest inventory which correlated moderately 
well with KPR—V. 
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Guilford and others (97) carried out an extensive factor analysis o{ 
95 short interest subtests but did not investigate the behavioral signifi. 
cance of the factors so isolated. Paisios and Remmers (157) identified 
three factors from their analysis of the SRA Youth Inventory. Their re. 
sults may be somewhat contaminated since the intercorrelations and 
reliabilities of the eight subtests were computed on the sample used for 
item analysis and for assigning items to the subtests. 

Strong (199) presented data on the issue: Which is preferable—a 
test with higher validity and lower reliability or a test with lower validity 
and higher reliability? The reviewers do not share with Strong his high 
regard for test reliability at the expense of validity. 

Personality inventories entered into four researches relating to cogni- 
tive performance. Sherriffs and Boomer (184) demonstrated that the 
penalty for guessing in achievement testing penalizes those who are 
insecure and poorly adjusted, according to Welsh’s A scale on the MMP]. 
Hoyt and Norman (110) concluded that scholastic aptitude tests do not 
predict well the achievement of students with high MMPI profiles. 
Bendig and Sprague (14) presented rectilinear and curvilinear correla. 
tions for the Guilford-Zimmerman Temperament Survey, psychology 
grades, and grade fluctuation; few of the relationships were significant. 
Altus (2) found 43 items on the MMPI which discriminated students 
whose Q and L scores on the ACE were widely discrepant. 

In an evaluation of counseling, Berdie (15) used the SVJB and the 
MMPI as criteria against which to judge the accuracy of self-estimates 
of vocational interests and personality. He found that, in general, changes 
in the self-estimates of a counseled group did not differ significantly 
from those of a noncounseled group. 

A refreshing and promising methodological approach was used by 
Little and Shneidman (125). Eleven psychologists competent in the use 
of the MMPI examined independently the profile of one psychiatric 
patient, and then Q-sorted 150 statements about the patient. These state- 
ments were also Q-sorted by 29 psychologists and psychiatrists who based 
their sorting on nonpsychometric information in the patient’s folder. 
The sortings of the MMPI experts and the clinicians agreed amazingly 
well. While it is unfortunate that the profile of only one patient was used, 
the study was well designed and deserves .o become a model for addi- 
tional validity studies. 


Suggestions for Research 


Thruout this chapter many implications for research have appeared. 
By way of summary, we have listed below those which, in our view, re- 
present some of the important possibilities. 

Research is badly needed on the relative effectiveness of different kinds 
of item form and item content. The characteristics of subtle and unfakable 
items are virtually unknown. Many important psychological dimensions 
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are unmeasured or unsatisfactorily measured possibly because of a lack 
of sufficiently subtle and sensitive items. 

Factor analysis, cluster analysis, scale analysis, and other approaches 
producing unidimensional scales should be used to analyze items which 
have already demonstrated ability to discriminate between groups. 
Subscales based on such items should be of considerable diagnostic and 
prognostic value. Much more work should be done with pattern analysis, 
configural scoring, and suppressor variables. Each of these approaches can 
generate or increase validity either directly or indirectly. 

The degree to which certain response sets contaminate the scores on in- 
ventories should be investigated. If the inventory scores cannot be freed 
of the influence of response set, test users should be supplied means for 
determining the role of response set in each person’s score. 

Studies need to be done on the relative value of the single-item test and 
the inventory. If both correlate moderately with the criterion and moder- 
ately with each other, then probably both have a place in personality and 
interest appraisal. If, however, the two measures correlate highly with 
each other and the same with the criterion, the usefulness of the inven- 
tory might be questioned. Much simple but important research can be done 
in this area. 

New scales constructed for use with nonpsychiatric persons and espe- 
cially precollege students should be composed of innocuous statements or 
questions. It is probable that important dimensions can be measured 
without using highly personal items such as those found in the MMPI and 
most other inventories. 

Many highly useful empirical scales could be constructed by item 
analyzing the answer sheets of those who are now psychotic, outstanding 
scholars, prison inmates, engineers, and the like, but who took the inven- 
tory 10, 15, or more years ago. There are many old answer sheets avail- 
able for this expensive but very important research. This approach would 
avoid, for example. what is probably the most serious methodological 
error in the construction of the SV/B; namely, that the criterion cases were 
tested at the wrong time. 
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CHAPTER IV 


Development and Applications of Projective Technics 


JOHN W. M. ROTHNEY and ROBERT A. HEIMANN 


Quanrrranivety, research on projective technics goes on at an ever 
increasing pace. Qualitatively, however, one is reminded of the Red 
Queen’s statement to Alice: “It takes all the running you can do, to keep 
in the same place.” The solution of the clinician’s dilemma in choosing 
between extremes of impressionism and objectivity, described by the 
authors in an earlier report (86), seems as far away as when it was 
described in 1951. The overwhelming mass of literature reported on pro- 
jective devices is punctuated only occasionally by a soundly designed 
validation attempt or a meaningful normative study. Without clearer 
formulation of the problems to be solved or of the hypothesis to be tested, 
there can be no significant contributions to theory or defensible changes 
in practice. 

One important shift in direction is noted in the new text by Klopfer 
and others (52) who stated that it is more productive to view the 
Rorschach as a method of observation and appraisal rather than to class 
it as a “test” of personality. From that premise they argued that research 
in the determination of its usefulness and productivity needs to be dif- 
ferent from that used in assessing the validity of a standard psycho- 
metrically designed instrument. Altho their point of view has not yet had 
much effect on the research covered in the period under review, it seems 
likely to do so in the future. Perhaps this is the beginning of the end of 
what has always seemed doubtful procedure—classifying personality 
appraisal technics as tests; insisting that their construction, validation, and 
use follow conventional psychometric patterns; and condemning them when 
they fail to do so. Hutt (46) pointed out that projective testing is a develop- 
ing technic of fundamental and permanent importance in the history of 
psychology and social science. He pointed out that current criticism of it 
in terms of its underlying theory, methodology, and use of inadequate 
criteria for validation is similar to the criticism aimed at Freudianism. 
He claimed that the projective movement remains important, as Freudian- 
ism has, despite its critics. Bellak and Brower (13) indicated that re- 
search in the area has proceeded vigorously with the trend in the direc- 
tion of tightening and re-evaluating the technics already developed rather 
than mass-producing new ones. 


New Tests and Books 


Over a dozen new tests, most of them inadequately described in terms of 
the criteria deemed necessary for psychological tests, appeared in the 
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three years covered in this Review. Wilmer and Husni (99) described 
their attempts to use records of 21 simple and complex mixtures of 
verbal, mechanical, and natural sounds as projective devices with college 
students, blind children, and tubercular patients. Their recordings are 
available for research purposes. Davids and Murray (29) offered another 
new auditory projective technic, the Azzageddi Test, and with 20 subjects 
found a correlation of .55 with various criteria of adjustment. Kutash 
and Gehl (56) presented some reliability and validity data for the 200 
normals and 200 schizophrenics on whom they standardized their graph- 
omotor drawing technic. McIntyre (69) explored the use of sound mo- 
tion pictures as a type of projective technic but found that little projec- 
tion seemed to be called forth. Failure to provide any standardization data 
makes Holsopple and Miale’s sentence completion test (43) of speculative 
utility only. A projective device designed for research purposes was pro- 
posed by Alexander (1) who named his instrument the Adult-Child Inter- 
action Test. Kahn (49) described a new symbol-arrangement test in which 
the subjects are said to reveal personality dynamics and state of mental 
health. Howard (45) outlined a new ink-blot test which has been tried 
out on some 200 subjects. Mills (76) reported a new incomplete-story 
test on the family, school, and fantasy life of children. Sargent (89) 
offered tentative norms on a new insight test in which the subject tells 
what a leading character did and how he felt about a problem situation. 
Kinget (51) presented validation data on a new drawing-completion 
test, and Koch (53) described a new tree-drawing test from which clinical 
interpretations similar to those used in handwriting analyses were made. 
Kunin’s new Szondi-type projective attitude scale (55) in which a subject 
is asked to check the facial expression that “looks like he feels” was tried 
out on 313 college students. 

No new textbooks dealing with the whole area of projective technics 
were published in the period of this Review, but several book-length 
studies, manuals, and bibliographies containing research data or sugges- 
tions for future research, appeared. Probably the most significant and 
certainly the most ambitious was the attempt by McClelland and his co- 
workers (68) to explore the achievement motive thru use of thematic 
pictures and stories. Witkin and others (100) attempted to relate in- 
dividual differences in perceptual function to significant aspects of per- 
sonality and to test the hypothesis that these tend to bear some relationship 
to one another. A psychoanalytic frame of reference for Rorschach inter- 
pretations was the theme of a new work by Schafer (90). Sarason (88) 
made a detailed presentation of approximately 100 Rorschach studies with 
implications for their interpretation. Allen (3, 4) offered a new manual 
and a new scoring system with a text on interpretation of Rorschach. 
Volume III of Beck’s new work (9) appeared, but seemingly failed to 
modify his earlier stand that Rorschach interpretation is based on faith 
and .hat “clinical validation” is still what the clinician feels is the correct 
procedure. Despite, for example, the research finding that there is no 
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substantial evidence of validity of interpretations based on color shock. 
the classical view of color and anxiety is again presented. Phillips and 
Smith (82) offered an advanced Rorschach manual. Lowenfeld (65) pub. 
lished the first full-length book on her mosaics; the first English transla. 
tion of Szondi’s own writings (94) appeared; and Vernier (96) presented 
a discussion text on her drawing technic. David (27) produced a 16. 
page bibliography on the Szondi text. Fry (35) wrote of a new Thematic 
Apperception Test (TAT) scoring scheme. Bellak (11) presented a book 
on TAT and Children’s Apperception Test (CAT) interpretation. 


Validity and Reliability Studies 


Studies purporting to present evidence of validity of projective technics 
tend to follow one of four patterns. Some attempt to show that there are 
differences in the projections of members of diagnostically different 
groups. Others compare the projective scores-with ratings or other measures 
of personality adjustment. A few study the changes in projection when the 
seoring procedures or testing conditions are changed and controlled. 
Some indicate that the results agree with a theoretical construct of per- 
sonality. Since there are too many studies to report here, the reviewers 
have described samples of each kind. Since the Rorschach continues to 
get the most attention, a larger sample of reports on that technic are pre. 
sented. 

In an attempt to determine whether the Rorschach could reveal real- 
life stress, Berger (15) tested one group of patients on admission to a 
tuberculosis treatment facility and a matched group six months after 
hospitalization. He found that initial stress of the admission group could 
be recognized in the Rorschach scores. Grant, Ives, and Ranzoni (40) 
used Rorschach scores in an attempt to classify 71 boys and 75 girls, all 
18 years old and generally described as normal, into well-adjusted and 
maladjusted groups. The outcome shook their confidence in the ability 
of Rorschach workers to analyze records of normal subjects for use in 
group research. Corsini and Uehling (24) attempted to test the validity 
of the Davidson Rorschach Adjustment Scale by using it with normal 
probationary prison grards and prison inmates who had been matched on 
the variables of sex, age, mental test score, and race. Of the 17 differences 
in signs investigated, only three were significant beyond the 10-percent 
level, and one of these was not in the direction anticipated. The authors 
computed that the use of the Rorschach as a screening device would be but 
8 percent more effective than chance. Gallagher (37), in his review of 
findings, pointed out that the consistent failure of projective technics 
to distinguish between normals and clinical groups lies in part in the 
definition of what is normal, and he called for a clear definition in future 
validation studies. 

Comparison of Rorschach performances with adjustment as measured 
by various scales continues to produce negative or contradictory results. 


58 


| = 


-_- mm enc CO lee ees Ce a ae 





February 1956 PROJECTIVE TECHNICS 





Hamlin (42) in a review of several studies along this line concluded that 
consideration of the methodology used is necessary to understand the 
findings. He recommended that not too simple and not too complex a 
judgmental unit be used. Two of the several studies summarized by Hamlin 
include those of Cummings (25) and Newton (79). The former developed 
10 scales of adjustment for each Rorschach card. He compared the 10 
combined judgment scores of 50 hospitalized veterans on this scale with 
criteria of adjustment prepared by psychiatrists and psychologists and 
found correlation coefficients ranging from .25 to .61. In his report 
this author claimed that blind interpretation of projective tests is a 
worthwhile training device, an interesting pastime, and a useful method- 
ological part of research on the bases and limits of clinical judgment. 
Newton, using similar criteria and similar subjects as Cummings, but 
employing a seven-point scale of adjustment, found coefficients ranging 
from .09 to .20. His statement that the test could not separate normals 
from the most severely disturbed patients is in sharp contrast with the 
findings of Cummings. Morris (77) administered the Rorschach to 120 
clinical psychology trainees and also applied rating scales as a criterion 
of adjustment. The Rorschach examiners were highly successful in estimat- 
ing ratings on the scale. Symonds (93) obtained seven blind interpreta- 
tions of one Rorschach protocol by seven experienced judges who justified 
the interpretations they had made. Using extensive case materials as 
criteria, he found that the blind interpretations of the test agreed with 
the criteria 33 to 78 percent of the time with the average at 65 percent. 
There were, as indicated by the variability of judgments, considerable 
differences in the interpretation of the Rorschach protocol by the judges. 
Bialick and Hamlin (16) attempted to predict the Wechsler-Bellevue in- 
telligence quotients of 25 neurotic subjects with selected W responses to 
the Rorschach. The correlation of .68 between predicted and actual 
quotients indicated that experienced Rorschach examiners can estimate 
intelligence test scores fairly well, but the small number of subjects limits 
the value of the study. Charles and Mech (22) classified 30 college 
students as well adjusted, moderately well adjusted, and maladjusted by 
using the Monroe technic for scoring Rorschach protocols. Predictions 
were then made for the extent of unacceptable responses the students 
would make under stress conditions. The conclusion was that the Rorschach 
scores were effective in predicting the behavior of the 30 students under 
stress. 

The effect of current modes of thinking upon the kinds of interpreta- 
tions a projective tester makes was discussed by Marcuse (71). Using 
one case, he showed how the climatological political atmosphere of 1953 
produced projection by a clinician in the interpretation of a protocol. 
Berger (14) showed, by comparing Rorschach trainees’ own scores with 
the frequency that similar scores were elicited from patients, that the 
authority value of an examiner’s personality can be transmitted to. a 
testee. Wedemeyer (98) got atypical and meager Rorschach reports 
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from 136 Navy enlisted men of average intelligence; this may be attributed 
to the fact that the examiner was female. Gibby (38) concluded that 
Rorschach variables cannot be regarded as corresponding solely with 
the subject’s personality and its functioning. His conclusion was based 
upon the differences in responses of 240 mental hospital patients who 
were queried on their Rorschach protocols by 12 examiners in an un. 
structured, free manner, and 135 college students whose Rorschach 
inquiry was conducted in a standardized manner by nine examiners. He 
concluded that the stimulus value of the examiner was important. Robin, 
Nelson, and Clark (85) showed that Rorschach content is in part a func- 
tion of current perceptual experience. They compared responses of a group 
of subjects who waited for the test in a room where anatomical and 
medical photographs were prominently displayed with a group who waited 
in a room decorated with “sexy” pictures and still another group who 
waited in a bare room. Sex responses increased almost significantly for 
the first two groups. Fiske and Baughman (34) demonstrated that scores 
of the Rorschach based on frequencies of responses in particular scoring 
categories are unsatisfactory psychological measures. Maradie (70) made 
a careful study of the effect of using the Rorschach cards in a different 
randomized, sequential order in a Latin-square design and concluded 
that the position of the cards is important. The later cards tend to produce 
more responses. The substitution of line drawings of the Rorsciach blots 
for the actual blots and the elimination of color did not produce signifi- 
cantly different perceptual behavior according to Baughman (8). This 
study was well done, but only 20 cases were used. A study by Mensh and 
Matarazzo (73) of the rejection of cards of 201 subjects suggested that 
the previously accepted premise that card rejection had psychodiagnostic 
significance was not tenable. 

The TAT is the projective device which receives most attention after the 
Rorschach, Lindzey (61) presented the 10 most commonly accepted as- 
sumptions concerning the TAT and reported his examination of empirical 
evidence for each. Webb and Hilden (97) conducted two studies to de- 
termine the extent to which intellectual and verbal ability were determinants 
of word counts on projective technics. They were led to the conclusion that 
word count cannot be considered as evidence of projective functioning 
unless the verbal functioning of a subject in several other situations is 
determined. Meyer and Tolman (75) used TAT cards suggesting family 
relations with 50 outpatient veterans and followed up with 10 therapy 
interviews. There was no relationship between the presence or absence of 
parental figures in their TAT stories and similar discussion during the 
interviews. Even if parents were discussed in both sessions, there was no 
significant similarity in the attitudes expressed. Dana’s study (26) of the 
usefulness of the TAT in clinical diagnosis is superior to many of those 
described above. His subjects were 50 normal college students and 50 each 
of hospitalized neurotics and psychotics. He devised a reliable, objective 
scoring system based on counts of objects in stimulus cards and popular 
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themes, the organization of stories, the introduction of items not actually 
in the cards, and the number of incongruities introduced. He found impor- 
tant discriminative data between his normal and other subjects in terms of 
popular responses, organization, and incongruities reported. Lindzey and 
Newburg (64) used 20 undergraduate psychology students in a study of 
the use of TAT to discover anxiety. The correlations between TAT scores 
and clinically derived criteria of anxiety were .40 or less. 

The CAT is based upon the premise that children project more into 
animal than human pictures. Biersdorf and Marcuse (17) tried six animal 
and six human pictures on 30 first-grade pupils and concluded that their 
findings were contrary to the belief that children respond more to animals 
than to humans in pictures. Light (59) administered five animal and five 
human pictures to 75 middle-grade children and found that they, too, 
showed better identification with humans than with animals. 

The validity of less commonly used projective technics has been sub- 
jected to some examination. Lindner (60) used the Blacky Pictures Test 
on 67 imprisoned males legally defined as sex offenders and 67 nonsex of- 
fenders who were matched on nine variables. He concluded that the Blacky 
Pictures Test is a valid indicator of psychosexual deviation in a selected 
population. Borstelmann and Klopfer (19) reviewed and evaluated criti- 
cally the pertinent research on the Szondi Test. They concluded that inter- 
pretation of this test is a tenuous process of undetermined validity. Cohen 
and Feigenbaum (23) analyzed the Szondi records of 200 male veteran 
neuropsychiatric patients and found that the scoring procedure was un- 
satisfactory. Dudek and Patterson (31), using the Szondi Test with 100 
subjects, inadequately described, found that the matching of pictures and 
corresponding descriptions was beyond the chance level. 

One of the most complete validation studies of a less frequently used 
projective technic, the Bender-Gestalt Test, was completed by Gobetz (39). 
He administered it to 108 white, male veterans classified as neurotics by 
their scores on the MMPI, and 285 white, male veteran controls in an 
attempt to determine whether it discriminated validly between such groups. 
In addition, he checked his results in a cross-validation attempt with an- 
other group of 64 neurotic and 54 control white nonveteran students. He 
successfully developed an objective scoring system with which he was able 
to discriminate between his clinical groups at a high level of confidence. 
None of the normal signs were found exclusively in normals and none of 
the abnormal signs were found entirely with the abnormal groups, but 
some overlap was found. His recommendation was that the scoring of this 
test be used as a supplement to other tests rather than as an instru- 
ment for elaborate interpretation of individual personality dynamics. 
Mehlman and Whiteman (72) presented evidence that responses to 
three pictures of the Rosenzweig Picture-Frustration Study (P-F), con- 
cerned with frustration situations, did not predict overt behavior of 
189 college psychology students in three situational frustraticn in- 
cidents. Levine and Galanter (58) found a slight indication that tree 
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drawings of 27 hospitalized paraplegic veterans in the House-Tree-Per. 
son Test (H-T-P) were indicators of trauma in their past experiences 
but that localization of time of shock was not possible. Sloan (91), review. 
ing studies of the validation of the H-T-P Test, pointed out that they lacked 
clarity or even logical statements about the concept of validation. He 
showed that most of the inferences drawn from this test are implicit 
rather than explicit. 

Rapkin (83) found little evidence of validity of the Projective Motor 
Test when he used a modification of the blind matching technic. Six ther- 
apists attempted to match their 36 patients with the personality descrip. 
tions obtained from the test and did not succeed. Luft (66) demonstrated 
that the interaction between subject and examiner has direct bearing on 
projective performance when he noted the differences between preferences 
for homemade inkblots by two groups of 30 college freshmen, one oi 
which had been interviewed in a warm, friendly manner and the other in a 
less friendly way. Stewart’s study (92) of the expression of personality in 
drawings and paintings of 78 boys and 77 girls in a high school seemed 
to convince him that stylistic or formal qualities of artistic productions 
of adolescents can be validly related to measures of personality. 

Persons who are familiar with the concepts of validity employed in 
achievement, intelligence, and aptitude testing and the methods used in 
determining and reporting their validity find, after examination of the 
validity studies in the area of projective technics, that the standards and 
procedures are, to say the least, different. Samuels (87) found that there 
are vast differences in the descriptions of persons who have taken the same 
projective test by clinicians using four different tests, and that several 
projective tests given the normal, superior adults in his study seem to 
measure very little in common despite a similarity in name of trait or 
syndrome. He concluded that the low validity of commonly used projective 
tests limits their value for the assessment of personality dynamics in 
normal adults. Bell (10) pointed out to projective testers that there was 
still need for better criteria, particularly where life histories are used, and 
for improved experimental designs. 

Altho several writers have pointed out for many years that some of the 
methods of securing evidence about reliability, particularly the test-retest 
and split-half methods, are not appropriate for use in appraisals of per- 
sonality, some authors use them. Blanton and Landsman (18) repeated 
the administration of the group Rorschach and MMPI to 126 college 
juniors after an interval of three months. They reported a coefficient of 
correlation of .68 between the two sets of Rorschach scores, using Monroe's 
scoring procedures. DuBois and Hildin (30) developed a revised list of 
18 popular Rorschach items and found an internal consistency figure of 
.63, higher than any previously reported P scale. Lindzey and Herman 
(63) reported the findings of low internal consistency figures for need 
categories of achievement, aggression, sex, abasement, and nurturance on 
the TAT with a population of 148 college students. The tetrachoric coeili- 
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cients between scores of 20 of these subjects after a two-month interval, 
using just four cards and 17 scoring variables, ranged from .00 to .94 
with the average near .50. Rank correlation coefficients were in the .30 
to .40 range, but the standard errors were high. Reliability of projective 
technics, if the classical psychometric concept of that term is applied, is 
still doubtful. 


Normative Procedures 


During the period under review, there seemed to be increasing interest 
in obtaining some normative data, and a few sporadic attempts were made 
to secure them for groups ranging from kindergarten children to persons 
70 years of age and older. Allen’s report (2) on the administration of the 
Rorschach 12 times to his own child at three-month intervals from ages 
4-4 to 7-0, suggested to him that the results could depict developmental 
aspects of preceptual processes reflecting intellectual and emotional matur- 
ation. He called for more longitudinal research to uncover such patterns 
of development. Meyer and Thompson (74) used 43 girls and 43 boys of 
kindergarten level in setting up norms for the Rorschach; and Bellak and 
Bellak (12) presented some normative data for 40 first- and second-grade 
children based on a supplement to the CAT. Ames and others (5, 6) pub- 
lished two studies of Rorschach responses, one for children aged 2 to 10, 
and one for persons aged 70 to 100. Both contain some normative tables. 
The study dealing with children’s responses involved analyses of proto- 
cols of 25 boys and 25 girls at 13 different age levels in an attempt to pro- 
vide normative data within a theoretical framework of child development. 
It is unfortunate that their population, children whose median intelligence 
test scores were superior and who came from predominately upper-middle 
class homes, limits any broad implications and universal applications. 

Ives, Grant, and Ranzoni (47) presented data on Rorschach responses 
of children and adolescents aged 11 to 18. Tomkins and Miner (95) varied 
common procedures in the projective field by testing 1500 cases of nonin- 
stitutionalized children over nine years of age and utilizing their scores 
for normative data on the Tomkins-Horn Picture Arrangement Test. These 
normative data, the most plentiful for any projective technic, are classi- 
fied by education, religion, social class, rural or urban residence, geo- 
graphic location, and performance on an intelligence test. Neff and Glaser 
(78) presented norms for 100 cases who took the Rorschach during visits 
to a vocational guidance bureau. The data were categorized for normals, 
neurotics, and psychotics. Brockway, Gleser, and Ulett (20) reported per- 
centile norms for 126 men screened by a psychiatrist and described as well 
adjusted. The group consisted largely of college students and military per- 
sonnel with a small group of patients used as an abnormal control, but 
data on the representativeness of the sampling of the normals other than 
the fact that they were paid volunteers were not provided. Fry (36) used 
the Rosenzweig P-F Study and the TAT with 236 college students and 226 
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prisoners. He made some group comparisons but did not report norms jn 
detail. Normative procedures for projective technics continue to lag be. 
hind those used in the simpler achievement and aptitude test development. 


Applications of Projective Technics 


It is not always possible to separate studies of the validity of projective 
technics from studies of application. At times the technics are used in situ- 
ations as if they were valid; the research in ‘applied situations is, in a 
sense, validation. It would seem more appropriate to wait until validation 
has been established before use is made of an instrument. Kimball (50), 
for example, used a sentence-completion technic in a study of scholastic 
underachievement of 20 adolescent boys in a private preparatory school. 
Their responses were compared with a control group of 100 boys, inade- 
quately described, from the population of the same school. It was found 
that the underachievers made a significantly higher number of negative 
statements about their fathers. The question can be raised, aside from the 
fact that the descriptions of the subjects are too limited and the design of 
the study open to question, whether this experiment is a study of the valid- 
ity of the technic for selecting underachievers or an application of a 
technic, whose validity is assumed, to a particular problem. Similar ques- 
tions can be asked about most of the application studies described below. 

Greenbaum and others (41) attempted to evaluate the usefulness of a 
modification of the TAT in the study of the difference between attitudes of 
physically handicapped children and those of a control group matched for 
age, intelligence test scores, and type of handicap. Altho the handicapped 
children did mention their difficulties, the differences between their atti- 
tudes and those of normals were insignificant. Holzberg and Hahn (44) 
adapted the Rozenzweig P-F Study in an attempt to elicit signs of extra- 
punitiveness in adolescents. The differences in responses of two equated 
groups from low socioeconomic classes, consisting of 17 institutionalized 
psychopaths and 20 normal high-school students, indicated that the technic 
could not discriminate between socially aggressive psychopaths and socially 
nonaggressive normals. Lebo (57) tested the hypothesis that the use of 
recommended toys would encourage children to express themselves to a 
greater extent than the use of nonrecommended toys or no toys at all. He 
found that with 20 normal children, whose 4692 verbatim statements were 
made to the same therapist in three one-hour sessions under each set of 
conditions, there were no significant differences in the extent of expression. 
McArthur and King (67) used the Vorhaus typology in their analysis of 
differences in Rorschach performances between 137 college undergraduates 
referred to a department of hygiene for help and 74 control students. They 
suggested that the Vorhaus method might be used as a screening device in 
spotting cases who need assistance. 

Three studies of attempts to use projective technics in analyses and pre- 
diction of teaching success are worth noting. Johnson (48) gave to 13 
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teachers the group Rorschach and the Alexander modification of the TAT 
in which the pictures relate to children in social and classroom situations. 
A multiple correlation coefficient of .819 between the TAT results and 
observed classroom behavior was obtained. Addition of the Rorschach 
scores did not raise the predictive efficiency. It should be noted that only 
13 subjects were used and no information was provided on how these 
teachers were selected. Ohlsen and Schultz (80) attempted to identify the 
best and poorest student teachers with the Alexander adaption of the TAT. 
Supervisors of the 98 student teachers sorted them into two groups of 49 
best and poorest teachers. Blind analyses of the theme contents of the 
TAT responses were then made; application of the X? test revealed sig- 
nificant differences on eight themes between these student groups. Page 
and Travers (81) compared the adjustment scores on the Monroe checklist 
for Rorschach responses with supervisors’ descriptions of 64 student 
teachers. The relationships were somewhat higher for elementary- than 
for secondary-school teachers, but neither was high enough to suggest that 
the test scores could be used for screening or selecting student teachers. 

Attempts to use projective technics for various selection and classifica- 
tion purposes continue to be made. A sample of such studies follows. 
Eron (33) looked into the possibility of using the Rorschach in the selec- 
tion of applicants for medical school training. His finding that there were 
no differences between the scores of 35 medical and 35 divinity students 
led him to the conclusion that the Rorschach could not be used successfully 
as a screening device in this area. Anastasi and Foley (7) developed a 
trial list of items drawn from the Draw-a-Person Test that discriminated 
between 50 well-adjusted and passing, and 50 maladjusted and failing 
student pilots at Randolph Field. On a second sample of similar subjects 
the trial list of items failed to distinguish between such groups. On the 
basis of data obtained from Szondi scores of 100 epileptics and 100 overt 
homosexual males David and Rabinowitz (28) concluded that the test 
should not be used routinely in clinical practice. Ritter and Eron (84) 
found that the outcome variable on the TAT did not differentiate among 
normal, psychoneurotic, and schizophrenic groups, but found significant 
group differences for other scoring factors. Lindzey and Goldberg (62) 
reported that four out of six sex differences they hypothesized were sub- 
stantiated at the 5-percent level in a study using the 7AT. They assumed 
that males would exceed females in aggression, sex, and achievement 
motives, and that the females:would exceed males in abasement, nurtur- 
ance, and narcissism. 


Conclusion 


It is interesting to note that unanimous agreement about the value of 
projective technics does not exist among the research workers who develop 
and use them. In a previous report Rothney and Heimann (86) indicated 
that there was an incteasing awareness among the projectivists of the 
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need for better validation procedures to get away from dogmatic asser. 
tions and unproved claims which were common in the early days of te. 
search in this field. 

The scientifically minded research worker who reads a sample of the 
studies reported above is likely to feel at least mildly disturbed. He wil] 
want to point out that research workers with projective technics often omit 
control groups or cases, leave out important descriptions of their subjects 
and methods, do not use enough subjects, employ faulty designs, pay little 
attention to norms, continue to employ concepts that research has shown 
to be faulty, interpret the product of the “test plus users” effect as a prod. 
uct of the test itself, describe criteria vaguely, present tentative reports as 
if they were final, and often introduce irrelevant information into their 
studies. Of course, if one reads carefully and critically the studies of many 
of the achievement, intelligence, and aptitude tests that are widely used, 
he must come to the conclusion that the faults listed above are not peculiar 
to projective testers. 

Critics of research in projective technics must note, as some of the pro. 
jectivists themselves, notably Burchard (21), Eron (32), Hamlin (42), 
Hutt (46), and Kubis (54) have done, that they are undertaking a task 
presumably more difficult than that commonly undertaken by psychom- 
etrists. Projectivists do not have any common metric comparable to the 
difficulty level concept used by achievement and intelligence testers on 
which their items can be scaled. Most of their tests must be administered 
individually and the time element becomes important. It may require about 
four hours to administer, score, analyze, interpret, and report a Rorschach. 
When that time is contrasted with the unlimited number of test adminis- 
trations that might be given with a group test, it will be seen that projective 
norms are harder to establish, and one can understand why the number of 
subjects in projective research is frequently small. This is the one area of 
measurement, however, in which the process of getting data from sub- 
jects is not neglected in the haste to get the scores to a computing machine. 
Statisticians have not yet developed satisfactory technics for treatment of 
the variables and relationships which projectivists profess to abstract from 
their data altho some may claim that the newer nonparametric and intra- 
individual statistical procedures might lend themselves to this sort of re- 
search. And it is still impossible, in the very nature of the projective situa- 
tion, to untangle the test administrator from the test score. At times these 
and other difficulties seem insuperable. The time seems to be ripe for a 
thoro examination of the whole subject of projection rather than a con- 
tinuation of the small, short-time, isolated studies which are typical of the 
research literature in this area during the period under review. 


66 


TAs, SCG bee iene 


# 
. 





February 1956 PROJECTIVE TECHNICS 





Bibliography 


. ALEXANDER, THERON. The Adult-Child Interaction Test: A Projective Test for 
Use in Research. Monographs of the Society for Research in Child Develop- 
ment, No. 55. Champaign, Ill.: Child Development Publications, 1955. 40 p. 

._ ALLEN, Ropert M. “An Analysis of Twelve Longitudinal Rorschach Records 
of One Child.” Journal of Projective Techniques 19: 111-16; June 1955. 

._ ALLEN, RoBert M. Elements of Rorschach Interpretation. New York: Inter- 
national Universities Press, 1954. 242 p. 

. ALLEN, ROBERT M. Introduction to the Rorschach Technique: Manual of Ad- 
ministration and Scoring. New York: International Universities Press, 1953. 
126 p. 

. Ames, Loutse B., and OTHERS. Child Rorschach Responses; Developmental 
Trends from Two to Ten Years. New York: Paul B. Hoeber, 1952. 310 p. 

. AMes, LouIse B., and OTHERS. Rorschach Responses in Old Age. New York: 
Harper and Brothers, 1954. 229 p. 

. AnasTAsi, ANNE, and Forey, Joun P., Jr. Psychiatric Selection of Flying 
Personnel: V. The Human-Figure Drawing Test as an Objective Psychiatric 
Screening Aid for Student Pilots. Project No. 21-37-002, Report No. 5. Ran- 
dolph Field, Texas: U. S. Air Force School of Aviation Medicine, 1952. 30 p. 

. BAUGHMAN, E. Earit. “A Comparative Analysis of Rorschach Forms with 
Altered Stimulus Characteristics.” Journal of Projective Techniques 18: 151-64; 
June 1954. 

. Beck, Samuet J. Rorschach’s Test: Advances in Interpretation. New York: 
Grune and Stratton, 1952. Vol. 3, 301 p. 

. Bett, JoHN E. “Projective Techniques and the Development of Personality.” 
Journal of Projective Techniques 17: 391-400; December 1953. 

. BELLAK, LEOPOLD. The Thematic Apperception Test and the Children’s Appercep- 
tion Test in Clinical Use. New York: Grune and Stratton, 1954. 282 p. 

2. BELLAK, LEOPOLD, and BELLAK, SONYA S. The Supplement to the Children’s 
Apperception Test. (C.A.T—S) New York: C. P. S. Co. (P. O. Box 42, 
Gracie Station), 1952. 8 p. 

. BELLAK, LEOPOLD, and BROWER, DANIEL. “Projective Methods.” Progress in 
Neurology and Psychiatry: An Annual Review. (Edited by E. A. Spiegel.) 
New York: Grune and Stratton, 1953. p. 517-26. 

. Bercer, Davin. “Examiner Influence on the Rorschach.” Journal of Clinical 
Psychology 10: 245-48; July 1954. 

. Bercer, Davin. “The Rorschach as a Measure of Real-Life Stress.” Journal 
of Consulting Psychology 17: 355-58; October 1953. 

. BIALICK, IRVING, and HAMLIN, Roy M. “The Clinician as Judge: Details of 
Procedure in Judging Projective Materials.” Journal of Consulting Psychology 
18: 239-42; August 1954, 

. BIERSDORF, KATHRYN R., and MARcusE, F. L. “Responses of Children to Human 
and to Animal Pictures.” Journal of Projective Techniques 17: 455-59; Decem- 
ber 1953. 

. Banton, Ricwarp, and LanpsMAN, THEoporE. “The Retest Reliability of the 
Group Rorschach and Some Relationships to the MMPI.” Journal of Con- 
sulting Psychology 16: 265-67; August 1952. 

. BORSTELMANN, LLoyp J., and KLOPFER, WALTER G. “The Szondi Test: A Review 
and Critical Evaluation.” Psychological FP. 'etin 50: 112-32; March 1953. 

. Brockway, ANN L.; GLESER, GOLDINE C.; aad ULerr, Georce A. “Rorschach 
Concepts of Normality.” Journal of Consulting Psychology 18: 259-65; August 
1954. 

. BURCHARD Epwarp M. L. “The Use of Projective Techniques in the Analysis 
of Creativity.” Journal of Projective Techniques 16: 412-27; December 1952. 

. CHARLES, HARVEY, and MECH, EpMUND. “Performance in an Operational 
‘Stress’ Situation Related to a Projective Technique.” Journal of Educational 
Research 48: 509-19; March 1955. 

. CoHEN, JAcoB, and FEIGENBAUM, Louris. “The Assumption of Additivity on 
the Szondi Test.” Journal of Projective Techniques 18: 11-16; March 1954. 


67 








Review OF EpUCATIONAL RESEARCH Vol. XXVI, No. 1 





24. 


25. 


26. 
27. 
28. 


29. 


30. 


31. 
32. 


33. 
34. 


35. 


36. 


37. 


41. 


42. 


45. 


68 


CoRSINI, RAYMOND J., and UEHLING, HAROLD F. “A Cross Validation of David. 
son’s Rorschach Adjustment Scale.” Journal of Consulting Psychology 1s: 
277-79; August 1954. 

CummMInécs, S. THOMAS. “The Clinician as Judge: Judgments of Adjustment 
from Rorschach Single-Card Performance.” Journal of Consulting Psychology 
18: 243-47; August 1954. 

DANA, RICHARD H. “Clinical Diagnosis and Objective TAT Scoring.” Journal! 
of Abnormal and Social Psychology 50: 19-24; January 1955. 

Davip, HENRY P. “A Szondi Test Bibliography, 1939-1953.” Journal of Projective 
Techniques 18: 17-32; March 1954. 

Davip, HENRY P., and RABINOWITZ, WILLIAM. “Szondi Patterns in Epileptic 
and Homosexual Males.” Journal of Consulting Psychology 16: 247-50; August 
1952. 

Davips, ANTHONY, and MurrAy, Henry A. “Preliminary Appraisal of an 
Auditory Projective Technique for Studying Personality and Cognition.” Amer- 
ican Journal of Orthopsychictry 25: 543-54; July 1955. 

DuBois, PuiLip H., and HILDEN, ARNOLD H. “A P Scale for the Rorschach: 
A Methodological Study.” Journal of Consulting Psychology 18: 333-36; Octo- 
ber 1954. 

DuDEK, FRANK J., and PATTERSON, HARRY O. “Relationships among the 
Szondi Test Items.” Journal of Consulting Psychology 16: 389-94; October 1952. 

ERON, LEONARD D. “Some Problems in the Research Application of the The- 
matic Apperception Test.” Journal of Projective Techniques 19: 125-29; June 
1955. 

ERON, LEONARD D. “Use of the Rorschach Method in Medicine School Selec- 
tion.” Journal of Medical Education 29: 35-39; May 1954. 

FISKE, DONALD W., and BAUGHMAN, E. EARL. “Relationships Between Rorschach 
Scoring Categories and the Total Number of Responses.” Journal of Abnormal 
and Social Psychology 48: 25-32; January 1953. 

FrY, FRANKLYN D. “Manual for Scoring the Thematic Apperception Test.” 
Journal of Psychology 35: 181-95; April 1953. 

Fry, Franktyn D. “A Normative Study of the Reactions Manifested by Col- 
lege Students and by State Prison Inmates in Response to the Minnesota 
Multiphasic Personality Inventory, the Rosenzweig Picture-Frustration Study, 
and the Thematic Apperception Test.” Journal of Psychology 34: 27-30; July 
1952. 

GALLAGHER, JAMES J. “Normality and Projective Techniques.” Journal of Ab- 
normal and Social Psychology 50: 259-64; March 1955. 


. GiBBy, RoBERT G. “Examiner Influence on the Rorschach Inquiry.” Journal of 


39. 


Consulting Psychology 16: 449-55; December 1952. 
GoBETz, WALLACE. A Quantification, Standardization, and Validation of the 
Bender-Gestalt Test on Normal and Neurotic Adults. Psychological Mono- 


graphs, No. 356. Washington, D. C.: American Psychological Association, 1954. 
28 p 


. GRANT, MARGUERITE Q.; IVES, VIRGINIA; and RANZONI, JANE H. Reliability 


and Validity of Judges’ Ratings of Adjustment on the Rorschach. Psychological 
Monographs, No. 334. Washington, D. C.: American Psychological Associa- 
tion, 1952. 20 p. 

GREENBAUM, MARVIN, and OTHERS. “Evaluation of a Modification of the The- 
matic Apperception Test for Use with Physically Handicapped Children.” 
Journal of Clinical Psychology 9: 40-44; January 1953. 

HAMLIN, Roy M. “The Clinician as Judge: Implications of a Series of Studies.” 
Journal of Consulting Psychology 18: 233-38; August 1954. 


. HoLsopp_e, JAMES Q., and MIALE, FLORENCE R. Sentence Completion: A Pro- 


jective Method for the Study of Personality. Springfield, Tll.: Charles C. Thomas, 
1954. 177 p. 


. Hotzperc, Jutes D., and HAHN, FRED. “The Picture-Frustration Technique 


as a Measure of Hostility and Guilt Reactions in Adolescent Psychopaths.’ 
American Journal of Orthopsychiatry 22: 776-97; October 1952. 

Howarp, JAmMEs W. “The Howard Ink Blot Test; A Descriptive Manual.” Jour- 
nal of Clinical Psychology 9: 209-54; July 1953. 








February 1956 PROJECTIVE TECHNICS 


46. 
47. 


48. 


49. 
50. 


51. 
52. 


55. 


57. 


59. 


62. 


71, 





Hutt, Max L. “Toward an Understanding of Projective Testing.” Journal of 
Projective Techniques 18: 197-201; June 1954. 

Ives, VIRGINIA; GRANT, MARGUERITE Q.; and RANZONI, JANE H. “The ‘Neu- 
rotic’ Rorschachs of Normal Adolescents.” Journal of Genetic Psychology 83: 
31-61; September 1953. 

JoHNSON, GRANVILLE B., Jr. “An Evaluation Instrument for the Analysis of 
Teacher Effectiveness.” Journal of Experimental Education 23: 333-44; June 
1955. 

KAHN, THEODORE C. Manual for the Kahn Test of Symbol Arrangement. Revised 
edition. Danville, Calif.: the Author (Parks Air Force Base), 1953. 70 p. 
KIMBALL, BARBARA. “The Sentence-Completion Technique in a Study of Scho- 
lastic Underachievement.” Journal of Consulting Psychology 16: 353-58; Octo- 

ber 1952. 

KINGET, G. MARIAN. The Drawing-Completion Test; A Projective Technique for 
the Investigation of Personality. New York: Grune and Stratton, 1952. 238 p. 

KLOPFER, BRUNO, and OTHERS. Developments in the Rorschach Technique. Yon- 
kers-on-Hudson, N. Y.: World Book Co., 1954. Vol. I, 726 p. 


. Kocu, CHARLES. The Tree Test: The Tree-Drawing Test as an Aid in Psycho- 


diagnosis. New York: Grune and Stratton, 1952. 87 p. 


. Kupis, JosepH K. “Projective Techniques in Guiding the Child.” Education 


74: 180-89; November 1953. 


KUNIN, THEODORE. “The Construction of a New Type of Attitude Measure.” 
Personnel Psychology 8: 65-77; Spring 1955. 


. KuTASH, SAMUEL B., and GEHL, RAYMOND H. The Graphomotor Projection 


Technique, Clinical Use and Standardization. Springfield, Ill.: Charles C. 
Thomas, 1954. 148 p. 

Lespo, DELL. “The Expressive Value of Toys Recommended for Nondirective 
Play Therapy.” Journal of Clinical Psychology 11: 144-48; April 1955. 


. LEVINE, MURRAY, and GALANTER, EUGENE. “A Note on the Tree and Trauma 


Interpretation in the H-T-P.” Journal of Consulting Psychology 17: 74-75; 
February 1953. 

Licut, BERNARD H. “Comparative Study of a Series of TAT and CAT Cards.” 
Journal oj Clinical Psychology 10: 179-81; April 1954. 


. LINDNER, HAROLD. “The Blacky Pictures Test: A Study of Sexual and Non- 


6l. 


Sexual Offenders.” Journal of Projective Techniques 17: 79-84; March 1953. 
LinDzEY, GARDNER. “Thematic Apperception Test: Interpretative Assumptions 
and Related Empirical Evidence.” Psychological Bulletin 49: 1-25; January 
1952. 
LINDZEY, GARDNER, and GOLDBERG, MORTON. “Motivational Differences Between 
Male and Female as Measured by the Thematic Apperception Test.” Journal 
of Personality 22: 101-17; September 1953. 


. LinpzEY, GARDNER, and HERMAN, PETER S. “Thematic Apperception Test: A 


Note on Reliability and Situational Validity.” Journal of Projective Techniques 
19: 36-42; March 1955. 


. LrinpzEY, GARDNER, and NEwWBuRG, ARTHUR S. “Thematic Apperception Test: 


A Tentative Appraisal of Some Signs of Anxiety.” Journal of Consulting 
Psychology 18: 389-95; December 1954. 


. LOWENFELD, MARGARET. The Lowenfeld Mosaic Test. London: Newman Neame, 


1954. 360 p. 


. Lurt, JosepH. “Interaction and. Projection.” Journal of Projective Techniques 


17: 489-92: December 1953. 


. McArtTHuR, CHARLES C., and KING, STANLEY. “Rorschach Configurations Asso- 


ciated with College Achievement.” Journal of Educational Psychology 45: 492- 
98; December 1954. 


. McCLEeLLAND, Davin C., and OTHERS. The Achievement Motive. New York: Ap- 


pleton-Century-Crofts, 1953. 384 p. 


. McIntyre, CHARLES J. “Sex. Age, and Iconicity as Factors in'Projective Film 


70. 


Tests.” Journal of Consulting Psychology 18: 337-43; October 1954. 
MarapieE, Louis J. “Productivity on the Rorschach as a Function of Order of 

Presentation.” Journal of Consulting Psychology 17: 32-35; February 1953: 
MaRrcusE, F. L. “Projection—1953.” American Psychologist 10: 43; January 1955. 














REVIEW OF EDUCATIONAL RESEARCH Vol. XXVI, No. 1 





72. 


73. 
74. 


75. 


76. 
77. 


78. 
79. 
80. 


81. 


82. 
83. 
84, 


85. 
86. 
87. 


88. 
89. 
90. 
91. 
92. 
93. 
94, 
95. 


96. 


70 


MEHLMAN, BENJAMIN, and WHITEMAN, STEPHAN L. “The Relationship Between 
Certain Pictures of the Rosenzweig Picture-Frustration Study and Correspond. 
ing Behavioral Situations.” Journal of Clinical Psychology 11: 15-19; January 
1955. 

MENSH, IVAN N., and MATARAzzO, JosEPH D. “Rorschach Card Rejection in 
Psycho-Diagnosis.” Journal of Consulting Psychology 18: 271-75; August 1954, 

MEYER, GEORGE, and THOMPSON, JACK. “The Performance of Kindergarten Chil. 
dren on the Rorschach Test: A Normative Study.” Journal of Projective Tech- 
niques 16: 86-111; February 1952. 

MEYER, MORTIMER M., and TOLMAN, RutH S. “Correspondence Between At- 
titudes and Images of Parental Figures in TAT Stories and in Therapeutic 
Interviews.” Journal of Consulting Psychology 19: 79-82; April 1955. 

MILLs, EuGENE S. “The Madeleine Thomas Completion Stories Test.” Journal 
of Consulting Psychology 17: 139-41; April 1953. 

Morris, Wooprow W. Rorschach Estimates of Personality Attributes in the 
Michigan Assessment Project. Psychological Monographs, No. 338. Washington, 
D. C.: American Psychological Association, 1952. 27 p. 

NEFF, WALTER S., and GLASER, NATHAN M. “Normative Data on the Rorschach.” 
Journal of Psychology 37: 95-104; January 1954. 

NEWTON, RICHARD L. “The Clinician as Judge: Total Rorschachs and Clinical 
Case Material.” Journal of Consulting Psychology 18: 248-50; August 1954. 

OHLSEN, MERLE, and SCHULTZ, RAYMOND E. “Projective Test Response Patterns 
for Best and Poorest Student Teachers.” Educational and Psychological Meas- 
urement 15: 18-27; Spring 1955. 

PaGE, MARTHA H., and TRAVERS, RoBert M. W. “Relationships Between 
Rorschach Performances and Student Teaching.” Journal of Educational 
Psychology 44: 31-40; January 1953. 

PHILLIPS, LESLIE, and SMITH, JosEPH G. Rorschach Interpretation: Advanced 
Technique. New York: Grune and Stratton, 1953. 385 p. 

RAPKIN, MAuRICE. “The Projective Motor Test: A Validation Study.” Journal 
of Projective Techniques 17: 127-43; June 1953. 

RITTER, ANNE M., and ERON, LEONARD D. “The Use of the Thematic Appercep- 
tion Test To Differentiate Normal from Abnormal Groups.” Journal of Abnormal 
and Social Psychology 47: 147-58; April 1952. 

Rosin, ALBERT; NELSON, WILLIAM; and CLARK, MARGARET. “Rorschach Con- 
tent as a Function of Perceptual Experience and Sex of the Examiner.” 
Journal of Clinical Psychology 10: 188-90; April 1954. 

ROTHNEY, JOHN W. M., and HEIMANN, ROBERT A. “Development and Applica- 
tions of Projective Tests of Personality.” Review of Educational Research 23: 
70-84; February 1953. 

SAMUELS, HENRY. The Validity of Personality-Trait Ratings Based on Projective 
Techniques. Psychological Monographs, No. 337. Washington, D. C.: American 
Psychological Association, 1952. 21 p. 

SARASON, SEYMOUR B. The Clinical Interaction: With Special Reference to the 
Rorschach. New York: Harper and Brothers, 1954. 425 p. 

SARGENT, HELEN D. The Insight Test: A Verbal Projective Test for Personality 
Study. New York: Grune and Stratton, 1953. 276 p. 

ScHAFER, Roy. Psychoanalytic Interpretation in Rorschach Testing: Theory and 
Application. New York: Grune and Stratton, 1954. 460 p. 

SLOAN, WILLIAM. “A Critical Review of H-T-P Validation Studies.” Journal o/ 
Clinical Psychology 10: 143-48; April 1954. 

Stewart, Louts H. “The Expression of Personality in Drawings and Paintings.” 
Genetic Psychology Monographs 51: 45-103; February 1955. 

Symonps, PercrvaL M. “A Contribution to Our Knowledge of the Validity of 
the Rorschach.” Journal of Projective Techniques 19: 152-62; June 1955. 

Szonp1, Lipot. Experimental Diagnostics of Drives. Translated by Gertrude Aull. 
New York: Grune and Stratton, 1952. 272 p. 

TomKINs, $.LVAN S., and MINER, JoHN B. “Contributions to the Standardization 
of the Tomkins-Horn Picture Arrangement Test: Plate Norms.” Journal o/ 
Psychology 39: 199-214; January 1955. 

VERNIER, CLAIRE M. Projective Test Productions. New York: Grune and Stratton, 
1952. 168 p. 








February 1956 PROJECTIVE TECHNICS 





97. Wess, WILSE B., and HILDEN, ARNOLD H. “Verbal and Intellectual Ability as 
Factors in Projective Test Results.” Journal of Projective Techniques 17: 
102-103; March 1953. 

98. WEDEMEYER, BARBARA. “Rorschach Statistics on a Group of 136 Normal Men.” 
Journal of Psychology 37: 51-58; January 1954. 

99. WitMER, Harry A., and Husni, May. “The Use of Sounds in a Projective 
Test.” Journal of Consulting Psychology 17: 377-83; October 1953. 

100. WITKIN, HERMAN A., and OTHERS. Personality Through Perception: An Experi- 
mental and Clinical Study. New York: Harper and Brothers, 1954. 571 p. 








71 








CHAPTER V 


Development and Applications of Tests 
of Educational Achievement 


BENJAMIN S. BLOOM and I. DE V. HEYNS 


Tue construction of educational achievement tests must start from some 
specification of the content and behaviors desired. Bloom and others 
(8) formulated a classification scheme for objectives at all levels of educa- 
tion, Dressel and Mayhew (27) contributed a review and analysis of ob. 
jectives of general education, and Kearney (65) reported a very thoro 
analysis of elementary-school objectives. Several of the recent textbooks on 
educational testing stressed the place of objectives in selecting standardized 
tests and in constructing teacher-made achievement tests. This is especially 
true of textbooks by Greene, Jorgensen, and Gerberich (55), Jordan (64), 
Remmers and Gage (99), Ross (101), and Travers (113). 

Perhaps the major point emphasized in the various approaches to the 
statement and classification of educational objectives is the need for clari- 
fying such objectives by specifying the behaviors involved. At one time 
educational objectives were stated in such a general way that they offered 
little in the way of directions or guides to test construction. Most of the 
references found in this area show real progress in stating objectives 
clearly enough to provide specifications not only for learning experiences, 
but also for achievement test construction. 


Development of Testing Technics 


Testers have always been very fertile in creating new testing technics 
and methods for appraising student achievement. One of the most promis- 
ing new testing technics is the tab item, described by Glaser, Damrin, and 
Gardner (53), which provides an objective method for testing the problem- 
solving steps or processes used by the examinee. Friedenberg (49) de- 
scribed an objective method of evaluating the student’s ability to apply the 
methods of the social sciences to the analysis of a piece of social science 
research. Shores and Saupe (106) devised a test of reading for problem- 
solving in science in an effort to measure skills not covered by other meas- 
ures of mental ability, achievement, or general reading ability. 

The merits of essay versus objective tests of writing were considered 
by Eley (41) and Pearson (97). Huddleston (60) found that most of the 
variance on essay questions, objective questions, and paragraph-revision 
exercises may be accounted for by a verbal factor and that a typical verbal 
test is a good index of writing ability. Reynolds (100) criticized both the 
essay and the objective examination and proposed a “project” type of 
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examination in which problems or projects are given the student to test 
his understanding of the subject. 

Experimentation on the multiple-choice item was reported by Dressel 
and Schmid (28). They used several modified forms of the multiple-choice 
item which required the student to examine critically the choices as a 
means of providing data on the student’s certainty of knowledge and the 
quality of his thought processes. Lorge (77) criticized objective tests, 
claiming that they do not estimate the ability of an individual or a group 
to produce ideas, to innovate plans, and to evidence originality in policy 
or concept. High-level decision making, as he viewed it, requires not only 
the selection from among choices, but also the capacity for developing the 
choices to be considered. A procedure for obtaining efficient distractors 
which minimizes the differences between item difficulty in the free-answer 
and multiple-choice form was given by Frederiksen and Satter (48). 

The use of a sound motion picture for testing a complex motor skill 
was explored by the Personnel Research Branch of the Army (121). 
Lindgren (76) described the use of a projective technic, the incomplete- 
sentences test, as a means of course evaluation and of detecting changes 
in attitude toward the course. 

A more specific approach to test construction in relation to particular 
subjects or areas of competence is evident in some of the following papers. 
A thoro review of the methods and principles of language testing was pro- 
vided by Carroll (13), the development of a listening comprehension test 
was described by Dow (24), and Lado (71) criticized the objective for- 
eign language test contending that it does not adequately appraise the stu- 
dent’s knowledge of critical linguistic elements. 

Michac! and Reeder (88) described their development of a study habits 
inventory. Smith and Glock (107) showed that a test designed to measure 
the application of content does measure something which is necessary for 
success but is not adequately covered by aptitude or by subjectmatter 
achievement tests. Webb (123) reported validation of a science interest test. 

While a large number of new tests have been produced in the period 
covered by this Review, the writers hesitate to attempt a systematic de- 
scription of the enormous production of new instruments. However, sev- 
eral new tests and test developments must at least be mentioned. The Area 
Tests of the Graduate Record Examinations (40) represent a very ambi- 
tious attempt to cover a large range of specific subjects in a few tests and 
reflect the increasing emphasis in general education on such large areas 
as the humanities, social sciences, and natural sciences. The new SRA 
Achievement Series (110) is an outstanding effort to provide for the 
coordinated evaluation of basic skills in Grades II thru IX by means of 
three batteries, or groups of separate tests, prepared in terms of the learn- 
ing objectives of modern education. The Tests of Important Educational 
Objectives being developed by the Educational Testing Service (14) rep- 
resent another attempt to develop tests of educational progress over the 
entire range of grades and subjectmatter. 
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Evidence of Change in Learners 


One of the primary purposes in the use of achievement tests is to deter. 
mine the extent to which learners have been altered by a method of teach. 
ing, a unit of instruction, or by an entire curriculum. Altho the use of pre. 
and post-tests appears to be the most obvious method of appraising change, 
relatively few studies are reported in the literature which involve such 
procedures. Evidently teachers, and many educational research workers, 
still hold to the assumption that students enter a unit of instruction with 
zero degree of competence or development in the area under consideration 
and that the final test is not only an index of relative standing among the 
students but is also an index of extent of growth for the period under 
consideration. Careful use of pre- and post-tests is, of course, found in the 
large studies involving cooperation among schools and employing a re- 
search and coordinating staff. A good example of such careful use of 
research methodology is found in a study by Dressel and Mayhew (27). 

Maize (84), working with retarded college freshmen, found statistically 
gréater improvement in a test of English usage for an experimental group 
that wrote 40 themes during the course and discussed them in class, as 
compared with another group which was drilled in grammar and spelling 
and wrote only 14 themes which were graded by the instructor and re- 
turned. In contrast, Dressel, Schmid, and Kincaid (29) found no signifi- 
cant difference between two groups when one group had 131 hours of writ- 
ing experience beyond the English course while the other group had only 
four hours of additional writing experience. The measures used were themes 
written at the beginning and end of the academic year. Bird (6) found no 
significant improvement in listening comprehension test scores for a group 
taking a communications skills course. 

In a study of word recognition skills from Grade IV thru the college 
freshman level, Triggs (117) found very little growth in the ability to 
hear and match sounds. In a study of spatial relations ability, Myers (89) 
reported some growth in a group tested before and after a course in engi- 
neering drawing and descriptive geometry. Because no control group was 
used, it is impossible to determine to what extent improvement could be 
attributed to practice or other effects. 

Lannholm (72) found significant growth in a large sample of college stu- 
dents on the Tests of General Education taken at the end of the sophomore 
and senior years of college. Greatest gain was made in the tests relating 
to the major field of study and for students who were at the top of the 
distribution at the end of the sophomore year. Bloom and Ward (7) 
found that students completing the University of Chicago general educa- 
tion program were superior on the Tests of General Education and were 
very high on each of five of the advanced tests intended only for senior 
majors in the various fields of study. They concluded that a well-integrated 
and sequential program of general education can result in both broad as 
well as deep levels of competence in the major subject fields. 
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Much of the research on change in learners has emphasized measures 
of information and understanding. Relatively few investigations have at- 
tempted to determine the effect of instruction on interests, attitudes, and 
adjustment. Brown (11) measured the science information and attitudes 
of pupils in Grades V and VIII and found greater differences in their 
information than in their science attitudes. Several studies were reported 
which stress human relations and interpersonal attitudes. Hayes and Conk- 
lin (58) reported that intergroup attitudes can be affected by special teach- 
ing procedures, especially by methods making use of vicarious experience. 
Kelley and Pepitone (67) discovered that a course in human relations pro- 
duced significant changes in ability to apply principles as well as in the 
development of more positive attitudes toward people. Gillies and Lastrucci 
(51) reported that a course in home and family living produced greater 
changes in information than in attitudes or personal adjustment. Tuckman 
and Lorge (118, 119) concluded that a course on the psychology of the 
adult had little effect on the attitudes of graduate students toward old 
people and the older worker. 

At a somewhat different level are studies of changes in interest and 
behavior not so clearly related to particular learning experiences. Bordin 
and Wilson (9) found changes in the Kuder Preference Record which 
reflected the changes in curriculum orientation of a group of college stu- 
dents, and Gustad (56) discovered that the social behavior and social 
preferences of a group of college students shifted after entrance to college 
with some evidence that the changes were influenced by the students’ social 
environment. 

Three miscellaneous studies of educational achievement are relevant 
here. Osborne and Sanders (96) gave the Graduate Record Examinations 
to graduate students and reported that with advancing age there is a de- 
cline in science scores, whereas social sciences, fine arts, and literature 
scores remain the same or rise. Burke and Anderson (12) compared the 
distribution of scores on the Metropolitan Achievement Tests made in 1939 
and in 1950 by elementary-school pupils. They found little difference in the 
results, but of those which were significant, more were in favor of the 
1939 group. Nelson (93) retested a group of college alumni with the 
Lentz C-R Opinionnaire 14 years after they had taken it as college students 
and found a tendency for attitudes to persist over this period. He found 
something of a trend toward more liberal attitudes altho regional differ- 
ences still persisted. 





Prediction of Academic Achievement 


With the possible exception of predicting outcomes of horse races and 
trends in the stock market, more energy and thought have been given to the 
prediction of academic achievement than to any other prediction problem 
with which the writers are acquainted. Educators, testers, and research 
workers have been obsessed with the prediction problem for the past four 
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decades and appear to strain every effort to raise the correlation between 
initial measures and later achievement measures beyond the usual correla. 
tion of + .50. Not always is the prediction research related to a practical 
problem of selection or diagnosis; most frequently it appears to derive 
from the availability of both evidence and statistical technics. Quite 
frequently, the initial evidence is a record of previous scholastic 
achievement. Williams and McQuary (124) found that rank in high- 
school class is the best single predictor of college success. Adams and 
Garrett (2) reported that grades in high-school physics and first-year 
college mathematics are the best predictors of grades in college physics. 

One or more tests of scholastic aptitude or achievement are usually found 
to be good predictors of later achievement. Hoerres and O’Dea (59) 
concluded that the American Council on Education Psychological Examina- 
tion part scores are differentially predictive of various first-year course 
grades. Jenson (62) found that tests of aptitude, tests of reading or 
vocabulary, and undergraduate grades are differentially predictive of 
graduate students’ achievement in different fields of study. Coleman (19) 
discovered that tests of algebra, English, and mechanical comprehension 
are most effective in predicting freshman engineering course grades. 
Schwellenbach (104) reported that tests of arithmetic achievement and 
algebra aptitude yield very high correlations with algebra achievement 
scores for eighth-grade pupils. Clark and Johnson (16) found that the 
Test of Interpretation of Reading Materials in the Natural Sciences is the 
best of the Tests of General Educational Development for predicting 
success in preflight subjects in the U. S. Naval School. 

At a somewhat more complex level is the use of factor or cluster analysis 
in relating initial measures and measures of later achievement or per- 
formance. Newman, French, and Bobbitt (94) reported that entrance and 
aptitude tests correlate significantly with grade criteria, but do not cor- 
relate highly with criteria of adaptability to life in the Coast Guard 
Academy or to athletic proficiency and attitudes. Adcock (3) discussed 
the relationship of such factors as high-level mental ability, perceptual 
speed, impulsiveness, ego-involvement, carefulness, and immediate memory 
with criteria of scholastic success in college. 

Somewhat related are the studies of differential predictability by 
various statistical and grouping technics. Tiedeman and Bryan (111) used 
discriminant analysis in predicting choice of field of concentration from 
Kuder Preference Record scores. Abelson (1) found that high-school 
grades are differentially predictive of college grades for boys as compared 
with girls whereas aptitude test scores reveal no such differential effect. 
Frederiksen and Melville (47) used an index of compulsiveness to divide 
a group of engineering students and found that predictions of first-year 
grades from selected Strong Vocational Interest Blank scales are higher 
for the noncompulsive than for the compulsive students. 

In the past decade or so, many workers have been concerned with the 
use of interest, attitude, and personality evidence in the prediction of 


76 








February 1956 Tests OF EpuCATIONAL ACHIEVEMENT 





achievement. Krathwohl (69) developed an index of industriousness which 
he found improved the prediction of college algebra grades. Brooks and 
Weynand (10) and Frandsen and Sessions (45) reported that Kuder 
Preference Record scores yield low but positive relationships with achieve- 
ment measures. Givens (52) found that the Kuder Preference Record 
scientific scale is not related to college science grades. Rust and Ryan (102) 
developed empirical keys for the Strong Vocational Interest Blank and 
found them to be of some value in distinguishing among overachievers, 
underachievers, and normal achievers in college. 

Several workers have attempted to relate attitude scales to achieve- 
ment. Bendig and Hughes (4) found a 30-item scale of student attitude 
toward statistics of some value in predicting grades in an introductory 
statistics course. However, Schultz and Green (103) did not find an attitude- 
interest questionnaire of much value in predicting college grades for 
women. Downie and Bell (25) reported that the Minnesota Teacher 
Attitude Inventory is of value in predicting grades and in distinguishing 
between good and poor prospective teachers among education freshmen 
and sophomores, as rated by faculty. 

Gough (54) reported that a 36-item personality scale yields mean cor- 
relations of +.38 and +.36 with college and high-school grades respec- 
tively. Clark (17) found that, altho specific syndromes on the Minnesota 
Multiphasic Personality Inventory do not have value in predicting grades, 
groups of items can be selected which separate achievers from non- 
achievers. Kimball (68) used a sentence-completion technic to differ- 
entiate underachievers from other students. She found evidence of signifi- 
cantly higher proportions of negative feelings toward fathers, guilt, and 
anxiety among the underachievers. Kuhlen and Collister (70) found that 
pupils who failed to complete high school were less acceptable to their 
classmates in Grades VI and IX. 

Another type of evidence which is emphasized in academic prediction 
is the information available about the student’s previous history and 
biography. Malloy (85, 86) reported that selected data from the individ- 
ual’s biography and personal history improve the prediction of college 
grades. He also used the same data to draw generalizations about the 
attitudes and perceptions of students in relation to academic achieve- 
ment. Myers (90) also found a weighted biographical score of some value 
in predicting college freshman grades. McQuary (83), using factor 
analysis, also found background data of value in predicting academic 
achievement. 

Before closing this section, the writers must confess their feelings of 
frustration in attempting to summarize the prediction research which is 
continually being done. We find evidence of similar feelings in the an- 
nounced policy of Educational and Psychological Measurement to publish 
brief outlines of prediction research rather than complete articles. Travers 
(114) offered a possible solution in his proposal of a tentative theory to 
guide research on the prediction problem. He suggested a series of cate- 
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gories to encompass the range of variables that must be considered in order 
to predict achievement and the types of research needed. A model such 
as this would permit coordination of the many disparate studies now being 
made and would, if successful, result in many economies as well as in an 
improved understanding of this very challenging research area. 


Educational Placement and Diagnosis 


Closely related to the use of tests in predicting achievement is their 
use as a basis for determining the instructional needs of students, includ- 
ing determinations of specific strengths and weaknesses which should be 
taken into consideration in planning and organizing an educational pro- 
gram. 

One of the earliest uses of examinations for assigning college credit 
was the system developed by the University of Buffalo for enabling high- 
school graduates to take college examinations before they began their 
college work. Some experiences with this program were described by 
Wagner (122). Detchen (23) discussed a more recently developed sys- 
tem of exemption examinations at the Pennsylvania College for Women. 
Lorge and Diamond (79) reported a limited use of placement tests to 
determine the appropriate English course for foreign students. 

At a more general level are several articles on the use of tests for 
educational diagnosis. Traxler (115) considered in some detail the ways 
in which tests can be used as a basis for different instructional procedures 
and suggested that tests be used to check on the effectiveness of instruction. 
Dressel (26) also stressed the place of testing in appraising the changes 
in students produced by instruction and in understanding the significance 
and adequacy of the individual’s behavior before and after instruction. 
Detchen (22)- questioned the value of testing unless it brings clear benefits 
to the individual student and suggested some of the ways in which stu- 
dents can be helped by tests and test data. 

The Tests of General Educational Development have been very widely 
used for the purposes of educational placement. These tests have been 
administered to several million individuals. A summary of the evidence 
on the validity of these tests and their use in colleges and industry was 
prepared by Tyler (120). The use of tests for appropriate placement of 
high-school graduates in college curriculums is likely to be further 
stimulated by the placement tests being developed by the College Entrance 
Examination Board (15). 


Scoring and Normative Problems 


Altho prediction studies implicitly assume that a school grade is a 
valid criterion of achievement and that an average of school grades is 
an even more valid criterion, studies of grades and grading standards 
have shown over and over the unreliability of teachers’ grades and some of 
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the sources of this unreliability. Several recent papers are pertinent here. 
Hadley (57) demonstrated that final average grades are related to the 
liking or acceptance of the student. Using achievement test scores, average 
grades, and teachers’ ratings of students as likable or acceptable, he found 
that 50 percent of the least liked students received grades lower than their 
measured attainment whereas 50 percent of the most liked students re- 
ceived grades higher than their measured attainment. Symonds (109) dis- 
cussed the effect of personal needs of teachers on their evaluation of 
pupils and the ways in which these distortions reduce the validity of 
teachers’ grades. Torgerson and Green (112) found that readers of 
English papers can be factor analyzed into groups on the basis of the 
types of judgments they make, suggesting that variations among judges, 
as well as in the students and their products, influence the grading process. 

Teachers’ grades involve both evidence and a judgment of quality 
based on the evidence. Since both are combined in an undifferentiated 
index, the distinction between the evidence used and the judgment of 
quality is not clear. However, when the evidence is in the form of a test 
or test results, it is possible to deal more clearly with the process of 
judgment of quality. Several writers have focused on the ways in which 
such judgments are or should be made. Nedelsky (92) proposed a technic 
for having instructors judge (in advance of actually giving the test) the 
levels of performance to be assigned different grade levels. His technic is 
one of making judgments on each item and then combining the judgment 
for each instructor as well as for a group of instructors. Lorge and Lorraine 
Kruglov Diamond (78, 80, 81, 82) presented research on the effectiveness 
of judges in estimating the relative as well as the absolute difficulty of 
arithmetic items. In one study (82) it was discovered that providing the 
judges with empirical evidence on a sample of items made for greater 
consistency among the judges, but did not improve estimates of relative 
and absolute item difficulty. However, in a later study (81), using ex- 
perienced teachers of high-school mathematics as judges, it was found 
that when they were given information concerning the item difficulties 
of a sample of items, estimates of absolute item difficulty were better. 
In another study (80), this additional information improved the estimates 
of absolute and relative item difficulty of the poorer judges. Estimates 
of absolute item difficulties can be predicted better by making use of 
the average rank order given by the judges rather than estimates of the 
percent passing each item (78). 

Another problem encountered in the scoring of achievement tests is one 
of the unit to be used. What is to be done about wrong answers in rela- 
tion to right answers? Are all right answers to be given equal value? 
How shall rate of response be related to accuracy of response? Nedelsky 
(91) reported that the ability to make right responses and the ability to 
avoid gross or crude errors are relatively independent. He suggested some 
methods for scoring the number of gross errors and relating it to the 
number of correct responses in the final scoring procedure. Coombs (21) 
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proposed a procedure in which examinees cross out all answers they con. 
sider wrong, but do not guess among the remaining answers. Swineford 
and Miller (108) experimented with instructions encouraging students 
to guess as contrasted with instructions not to guess and found that the 
amount of guessing varied with the instructions given. They found little 
relationship between ability and tendency to guess. Jackson (61) found 
the usual high relationship between the number of right responses and 
the score involving correction for guessing. Sherriffs and Boomer (105) 
used the Minnesota Multiphasic Personality Inventory to study the charac- 
teristics of students in relation to scoring methods. They found that stu- 
dents with MMPI scores indicative of introversion, low self-esteem. and 
undue concern with the impression they make on others are penalized 
most by the right-minus-wrong scoring method since they tend to omit 
more items and to omit items for which they know the answer. However, 
Keislar (66) found that, if the test instructions state how the test is to 
be scored but do not stress guessing, the omission of responses by individ- 
uals with certain personality traits is reduced. 

‘The effect of rate on test performance is another important considera- 
tion. Ebel (36) recommended that, if a rate-free measure is required, 
enough items should be included to keep all examinees at work for the 
whole test period, but that the scores should be expressed as the relation 
of the correct answers to the number of items reached. In another paper. 
Ebel (34) found little relationship between the rate of response on college 
aptitude tests and accuracy of responses. He also reported that the rate 
scores have little value in predicting academic success. In a third paper 
on rate of response, Ebel (39) found that in certain situations, when tests 
of equal time length are used, a test made up of items which require a 
short response time is more valid than a test made up of fewer items re- 
quiring a longer response time. The criterion was a long test of the same 
general type. Frederiksen (46) investigated the effect of separate timing 
of the parts of the Cooperative Reading Test as contrasted with an over-all 
time for the entire test and the use of instructions which direct the exam- 
inees to read the questions before referring to the relevant reading passages. 
He found no significant differences. This study suggests that complicated 
timing and complex directions for administering tests should be subjected 
to experimental trial to determine their usefulness and relevance. 

The method of scaling, types of norms that are appropriate, and methods 
of combining scores are also important matters to be considered in the 
treatment of test data. Findley (44) recommended the development of 
scales which can be used to report growth in terms of the feats the in- 
dividual can perform and which are independent of present practices in 
teaching school subjects or organization of instruction by grades. Manuel 
(87) stressed the desirability of developing a set of tests which will give 
comparable scores on different languages in order to provide a means for 
relating the level of achievement in a foreign language to the examinee’s 
level of achievement in his native language. Durost and Prescott (32) 
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proposed a measure which may be used in comparing capacity with 
achievement and outlined a method to meet these specifications. Clark (18) 
proposed a method of relating the results of a group’s achievement test 
scores to the expected level based on the group’s median on a test of 
mental ability. 

The whole problem of what is measured by intelligence and achieve- 
ment tests was raised again by Coleman and Cureton (20) who reported 
a 95-percent overlap between the results of a group intelligence test and 
three subtests of an achievement test battery. The group intelligence test 
and the achievement test thus measure similar functions. The authors were 
of the opinion that, contrary to popular belief, achievement tests may 
actually provide more information concerning differences in native capacity 
than do group intelligence tests. 


Analyses of Student Responses to Tests 


Test workers have frequently limited their analyses of tests to the units 
provided by scores or subscores. More recently, some testers have found it 
desirable to study the student’s responses to individual items or the charac- 
teristics of the student’s responses to selected groups and patterns of items. 

Rapaport and Berg (98) found that response sets or biases to respond 
to certain positions in a multiple-choice test are characteristic of some 
individuals rather than of the group, and are likely to appear when a 
student cannot base his choice on knowledge or on the formal charac- 
teristics of the item. Norman (95) discovered that students get higher 
scores on a multiple-choice psychology test when the items are arranged 
in the order they were presented in the textbook than when they are 
arranged in the reverse order. This effect was not found for true-false 
items. Gaier, Lee, and McQuitty (50) concluded that consistent response 
sets or methods of reasoning could be isolated in several different types 
of logical inference items. 

Leichty (73) investigated multiple-choice items which did not stand 
up under conventional item analysis. By having students think aloud as 
they answered the questions, he was able to show for these items that 
the students attempted to recall specific information they had learned 
rather than to organize their knowledge into generalizations applicable to 
new situations. Jones (63) found a discrepancy between students’ selection 
of correct answers in an objective test and their justifications for their 
choices. He claimed that it is important to consider not only the student’s 
conclusion on an item, but also his process of arriving at it. 








Problems of Communication among Test Specialists, 
Test Users, and Students 


Altho achievement tests may be used as instruments for basic research 
on education and learning, it is much more common to find them used 
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for some specific and practical problem in connection with the schools. 
It is in this connection that problems arise in the judgment of the adequacy 
and relevance of specific tests, in the application of the results to 
specific problem, and in the interpretation of results by teachers as well 
as by students. 

Dyer and King (33) prepared a manual which should be of value to 
users of the College Entrance Examination Board’s tests in determining 
the appropriateness of tests, the purposes for which these tests may be used 
by the schools, and the ways in which the test scores are reported and in. 
terpreted. Berdie (5) suggested that more test data are being supplied 
to test users than they are capable of using and suggested that the remedy 
is in more training for the test user and more involvement of the test 
user in the testing process. Lennon (74) stressed the point that well-con- 
structed test manuals can provide an effective medium of communica. 
tion between test maker and test user. 

Engelhart (42) reported ways in which cooperation between teachers 
and testers in test construction can be valuable in improving both in- 
struction and evaluation. Ebel (35, 37, 38) discussed some of the mis. 
conceptions test users have about the value of tests, the nature of tests, and 
the meaning of test scores. He described the ways in which a college 
examination service can help college teachers give better tests and the 
procedures which can be used by a test service in analyzing and reporting 
the characteristics of a test to college teachers. Lennon (75) was concerned 
about the possibility that the taking of tests may be an upsetting experience 
for pupils and may be inimical to the relationship between pupil and 
teacher. He suggested that, if both teachers and pupils recognize that they 
are working toward common goals, the pupils will develop more positive 
attitudes toward tests and test results. 


Needs and Trends 


Findley (43) reviewed some of the major trends in achievement test- 
ing. Several of the very significant trends are: (a) Tests are increasingly 
used to measure growth and development rather than status at a particular 
time. (b) Achievement testing is increasingly requiring the student to 
apply his knowledge in novel situations. (c) Increasing use is being made 
of batteries of tests rather than tests in single subjects. (d) Greater 
efforts are being made to evaluate problem-solving processes and technics 
thru the use of test items in which the sequence of answers may be as 
important as their accuracy. (e) Recognition of the differential guidance 
value of quantitative, verbal, and other types of tests is increasing. |) 
Interest in the development of scales and units which are independent of 
age and grade norms is increasing. Among trends pointed out by Durost 
(30), many of which overlap those mentioned by Findley, are the shift 
of interest from statistical validity of tests toward curriculum validity and 
the trend toward differentiated measures of intelligence. 
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Durost (31) found several weaknesses in present school evaluation pro- 
grams, particularly inadequate training of teachers in evaluation and 
measurement technics, and stressed the need for increased budgetary pro- 
visions and adequate systems of record keeping. He recommended the 
formation of a committee to stress the types of training needed by teachers 
in this area and to recommend and devise materials to improve the in- 
service education of teachers. In a study of college admissions pro- 
cedures, Traxler and Townsend (116) reported that a wider use of 
standardized tests and cumulative records has increased the reliability 
of the quantitative data furnished to the colleges by the secondary schools. 


Bibliography 


1. ABecson, Ropert P. “Sex Differences in Predictability of College Grades.” 
Educational and Psychological Measurement 12: 638-44; Winter 1952. 

. Apams, SAM, and GARRETT, H. L. “Scholastic Background as Related to Success 
in College Physics.” Journal of Educatioral Research 47: 545-49; March 1954. 

3. Apcock, C. J. Intelligence and High Level Achievement. Victoria University 
College Publications in Psychology. Wellington, New Zealand: Victoria Uni- 
versity, 1952. 27 p. 

4. Benpic, A. W., and Hucues, J. B. “Student Attitude and Achievement in a 
Course in Introductory Statistics.” Journal of Educational Psychology 45: 
268-76; May 1954. 

5. BeRDIE, RALPH F. “Bringing National and Regional Testing Programs into 
Local Schools.” Proceedings, 1953 Invitational Conference on Testing Prob- 
lems. Princeton, N. J.: Educational Testing Service, 1953. p. 78-83. 

6. Birp, DoNALD E. “Teaching Listening Comprehension.” Journal of Communi- 
cation 3: 127-30; November 1953. 

7. BLOOM, BENJAMIN S., and WARD, F. CHAMPION. “The Chicago Bachelor of 
- Degree after Ten Years.” Journal of Higher Education 23: 459-67; Decem- 
er 1952. ‘ 

8. BLOOM, BENJAMIN S., and OTHERS. Taxonomy of Educational Objectives. Pre- 
liminary edition. New York: Longmans, Green and Co., 1954. 192 p. 

9. BorpIn, EDWARD S., and WILSON, EARL H. “Change of Interest as a Function 
of Shift in Curricular Orientation.” Educational and Psychological Measure- 
ment 13: 297-307; Summer 1953. 

10. Brooks, MELVIN S., and WEYNAND, ROBERT S. “Interest Preferences and 
Their Effect upon Academic Success.” Social Forces 32: 281-85; March 1954. 

1l. BRown, STANLEY B. “Science Information and Attitudes Possessed by California 
Elementary School Pupils.” Journal of Educational Research 47: 551-54; 
March 1954. 

12. Burke, Norris F., and ANperson, KennetH E. “A Comparative Study of 
1939 and 1950 Achievement Test Results in the Hawthorne Elementary 
School in Ottawa, Kansas.” Journal of Educational Research 47: 19-33; Sep- 
tember 1953. 

13. CarroLi, Joun B. “Some Principles of Language Testing.” Report of the Fourth 
Annual Round Table Meeting on Linguistics and Language Teaching. Mono- 
graph Series on Languages and Linguistics, No. 4. Washington, D. C.: George- 
town University, School of Foreign Service, Institute of Languages and 
Linguistics, 1953. p. 6-10. 

14. CHAUNCEY, HENRY. Annual Report to the Board of Trustees, 1952-53. Princeton, 
N. J.: Educational Testing Service, 1953. p. 24-27. 

15. CHauNceYy, HENRY. Annual Report to the Board of Trustees, 1953-54. Princetoa, 
N. J.: Educational Testing Service, 1954. p. 39-43. 

16. CLARK, BRANT, and JoHNson, WoopsBury. The Tests of General Educational 
Development (College Level) as Predictors of Performance in the U. S. Naval 
School, Pre-Flight. Naval School of Aviation Medicine Research Report, Project 
No. NM 001 059. 16.01. Pensacola, Fla.: U. S. Naval School of Aviation 
Medicine, 1952. 3 p. 


to 


83 











REVIEW OF EDUCATIONAL RESEARCH Vol. XXVI, No. } 








17. 
18. 
19. 
20. 


21. 
22. 


25. 


26. 


27. 


31. 


32. 


37. 


84 


CiaRK, J. H. “Grade Achievement of Female College Students in Relation to 
Non-Intellective Factors: MMPI Items.” Journal of Social Psychology 37. 
275-81; May 1953. 

CLARK, WILLIS W. “Evaluating School Achievement in Basic Skills in Rela. 
tg to Mental Ability.” Journal of Educational Research 46: 179-91; Novem. 

r 1952. 

COLEMAN, WILLIAM. “An Economical Test Battery for Predicting Freshman 
Engineering Course Grades.” Journal of Applied Psychology 37: 465-67: De. 
cember 1953. 

COLEMAN, WILLIAM, and CURETON, Epwarp FE. “Intelligence and Achievement: 
The Jangle Fallacy Again.” Educational and Psychological Measurement \4: 
347-51; Summer 1954. 

Coomss, CLypE H. “On the Use of Objective Examinations.” Educational and 
Psychological Measurement 13: 308-10; Summer 1953. 

DETCHEN, LILy. “The Evaluation Dividend for the Individual Student.” Pro. 
ceedings, 1953 Invitational Conference on Testing Problems. Princeton, N. J 
Educational Testing Service, 1953. p. 17-22. 


. DETCHEN, Lity. “A Program of Required Exemption Examinations.” Journal 


of Higher Education 24: 249-53; May 1953. 


. Dow, CLYDE W. The Development of Listening Comprehension Tests for Michigan 


State College Freshmen. Doctor’s thesis. East Lansing: Michigan State College, 
1952. 265 p. Abstract: Dissertation Abstracts 13: 268-69; No. 2, 1953. 

Downie, NORVILLE M., and BELL, C. R. “The Minnesota Teacher Attitude Inven. 
tory as an Aid in the Selection of Teachers.” Journal of Educational Research 
46: 699-704; May 1953. 

DRESSEL, Pau L, “Evaluation as Instruction.” Proceedings, 1953 Invitational 
Conference on Testing Problems. Princeton, N. J.: Educational Testing Service, 
1953. p. 23-34. 

DRESSEL, PAUL L., and MAYHEW, Lewis B. General Education: Explorations in 
Evaluation. Washington, D. C.: American Council on Education, 1954. 302 p. 


. DRESSEL, PAUL L., and ScHmip, JOHN. “Some Modifications of the Multiple- 


Choice Items.” Educational and Psychological Measurement 13: 574-95; Winter 
1953. 


. DRESSEL, PAUL; SCHMID, JOHN; and KINCAID, GERALD. “The Effect of Writing 


Frequency upon Essay-Type Writing Proficiency at the College Level.” Journal 
of Educational Research 46: 285-93; December 1952. 


. DuRost, WALTER N. “Modern Trends in Testing and Guidance.” Modern Educa- 


tional Problems. Report of the Seventeenth Educational Conference held under 
the auspices of the Educational Records Bureau and the American Council on 
Education. Washington, D. C.: American Council on Education, 1953. p. 111-19. 

Durost, Watrer N. “Present Progress and Needed Improvements in School 
Evaluation Programs.” Educational and Psychological Measurement 14: 247-54; 
Summer 1954. 

DurRost, WALTER N., and Prescott, GEORGE A. “An Improved Method of Com- 
paring a Capacity Measure with an Achievement Measure at the Elementary 
School Level.” Educational and Psychological Measurement 12: 741-55; Winter 
1952. 


. Dyer, Henry S., and KING, RIcHARD G. College Board Scores; Their Use and 


Interpretation. Princeton, N. J.: College Entrance Examination Board, 1955. 
192 


. EBEL, Ronser L. “The Characteristics and Usefulness of Rate Scores on College 


Aptitude Tests.” Educational and Psychological Measurement 14: 20-28; Spring 
1954. 


. EBet, RosBert L. “How an Examination Service Helps College Teachers To Give 


Better Tests.” Proceedings, 1953 Invitational Conference on Testing Problems. 
Princeton, N. J.: Educational Testing Service, 1953. p. 3-16. 


. Eset, Ropert L. “Maximizing Test Validity in Fixed Time Limits.” Educational 


and Psychological Measurement 13: 347-57; Summer 1953. 

Eset, Ropert L. “Problems of Communication Between Test Specialists and 
Test Users.” Educational and Psychological Measurement 14: 277-82; Summer 
1954. 


. Eger, Ropert L. “Procedures for the Analysis of Classroom Tests.” Educational 


and Psychological Measurement 14: 352-64; Summer 1954. 








d 


February 1956 Tests OF EpuCATIONAL ACHIEVEMENT 





39. Eset, Ropert L. “The Use of Item Response Time Measurements in the Con- 
struction of Educational Achievement Tests.” Educational and Psychological 
Measurement 13: 391-401; Autumn 1953. 

40, EDUCATIONAL TESTING SERVICE. Institutional Testing Program: Handbook for 
Deans and Examiners, 1955-56, Graduate Record Examinations. Princeton, 
N. J.: the Service, 1955. p. 3-6. 

4]. Evey, EARLE G. “The Test Satisfies an Educational Need.” College Board 
Review, Winter 1955. p. 9-13. 

42. Encetnart, Max D. “Making Testing Meaningful to Teachers Through Local 
Test Construction and Analysis of Test Data.” Proceedings, 1953 Invitational 
Conference on Testing Problems. Princeton, N. J.: Educational Testing Service, 
1953. p. 84-89. 

43. FINDLEY, WARREN G. “Progress in the Measurement of Achievement.” Educa- 
tional and Psychological Measurement 14: 255-60; Summer 1954. 

44, FINDLEY, WARREN G. “Studying the Individual Through the School’s Testing.” 
Modern Educational Problems. Report of the Seventeenth Educational Con- 
ference held under the auspices of the Educational Records Bureau and the 
American Council on Education. Washington, D. C.: American Council 
on Education, 1953. p. 38-47. 

45. FRANDSEN, ARDEN N., and SESSIONS, ALWYN D. “Interests and School Achieve- 
ment.” Educational and Psychological Measurement 13: 94-101; Spring 1953. 

46. FREDERIKSEN, NORMAN. “The Influence of Timing and Instructions on Coop- 
erative Reading Test Scores.” Educational and Psychological Measurement 12: 
598-607; Winter 1952. 

47. FREDERIKSEN, NORMAN, and MELVILLE, S. D. “Differential Predictability in the 
Use of Test Scores.” Educational and Psychological Measurement 14: 647- 
56; Winter 1954. 

48. FREDERIKSEN, NORMAN, and SATTER, GEORGE. “The Construction and Valida- 
tion of an Arithmetical Computation Test.” Educational and Psychological 
Measurement 13: 209-27; Summer 1953. 

49. FRIEDENBERG, Epcar Z. “The Measurement of the Insight of Graduate Students 
into the Methods of the Social Sciences.” Educational and Psychological 
Measurement 12: 350-67; Autumn 1952. 

50. GaAreR, EUGENE L.; LEE, MARILYN C.; and McQuirry, Louis L. “Response 
Patterns in a Test of Logical Inference.” Educational and Psychological Meas- 
urement 13: 550-67; Winter 1953. 

51. Grttres, DUNCAN V., and LasTRUCCI, CARLO L. “Validation of the Effectiveness 
of a College Marriage Course.” Marriage and Family Living 16: 55-58; February 
1954. 

52. Givens, PAuL R. “Kuder Patterns of Interest as Related to Achievement in 
— Science Courses.” Journal of Educational Research 46: 627-30; April 
1953. 

53. GLASER, ROBERT; DAMRIN, Dora E.; and GARDNER, FLoyp M. “The Tab 
Item: A Technique for the Measurement of Proficiency in Diagnostic Problem- 
Solving Tasks.” Educational and Psychological Measurement 14: 283-93; 
Summer 1954. 

54. Goucn, Harrison G. “The Construction of a Personality Scale To Predict 
— Achievement.” Journal of Applied Psychology 37: 361-66; October 
1953. 

55. GREENE, HARRY A.; JORGENSEN, ALBERT N.; and GERBERICH, J. RAYMOND. 
Measurement and Evaluation in the Elementary School. Second edition. New 
York: Longmans, Green and Co., 1953. 617 p. 

56. Gustap, Joun W. “A Longitudinal Study of Social Behavior Variables in College 
Students.” Educational and Psychological Measurement 12: 226-35; Summer 
1952. 

57. Hapiey, S. Trevor. “A School Mark: Fact or Fancy?” Educational Adminis- 
tration and Supervision 40: 305-12; May 1954. 

58. Haves, MARGARET L., and CONKLIN, Mary E. “Intergroup Attitudes and Ex- 
perimental Change.” Journal of Experimental Education 22: 19-36; September 
1953. 

59. Hoerres, Mary A., and O’Dea, J. Davin. “Predictive Value of the A.C.E.” 
Journal of Higher Education 25: 97; February 1954. 





85 








REVIEW OF EDUCATIONAL RESEARCH Vol. XXVI, No. 1 








60. HUDDLESTON, EpITH M. “Measurement of Writing Ability at the College-En. 


61. 
62. 


63. 


64. 
65. 
66. 
67. 
68. 


69. 


70. 


71. 


72. 


75. 
76. 


rif 


78. 


79. 


81. 


82. 


86 


trance Level; Objective vs. Subjective Testing Techniques.” Journal of Experi. 
mental Education 22: 165-213; March 1954 

JacKsON, Ropert A. “Guessing and Test Performance.” Educational and Psy. 
chological Measurement 15: 74-79; Spring 1955. 

Jenson, RaAtpu E. “Predicting Scholastic Achievement of First-Year Graduate 
Students.” Educational and Psychological Measurement 13: 322-29; Summer 
1953. 

JONES, STEWART. “Process-Testing—An Attempt To Analyze Reasons for Stu- 
dents’ Responses to Test Questions.” Journal of Educational Research 46: 
525-34; March 1953. 

JORDAN, ARTHUR M. Measurement in Education. New York: McGraw-Hill Book 
Co., 1953. 533 p 

KEARNEY, NoLan. C. Elementary School Objectives. New York: Russell Sage 
Foundation, 1953. 189 p 

KEISLAR, EVAN R. “Test Distentiens and Scoring Method in True-False Tests.” 
Journal of Experimental Education 21: 243-49; March 1953. 

KELLEY, HAROLD, and PEPITONE, ALBERT. “An Evaluation of a College Course in 
Human Relations. ” Journal of Educational Psychology 43: 193-209; April 1952. 

KIMBALL, BARBARA. “The Sentence-Completion Technique in a Study of Scholas- 
tic Underachievement.” Journal of Consulting Psychology 16: 353-58; October 
1952. 

KRATHWOHL, WILLIAM C. “Relative Contributions of Aptitude and Work Habits 
to Achievement in College Mathematics.” Journal of Educational Psychology 
44: 140-48; March 1953. 

KUHLEN, RAYMOND G., and COLLISTER, E. GORDON. “Sociometric Status of 
Sixth- and Ninth-Graders Who Fail To Finish High-School.” Educational and 
Psychological Measurement 12: 632-37; Winter 1952. 

Labo, ROBERT. “Test the Language.” Report of the Fourth Annual Round Table 
Meeting on Linguistics and Language Teaching. Monograph Series on Lan- 
guages and Linguistics, No. 4. Washington, D. C.: Georgetown University, 

ool of Foreign Service, Institute of Languages and Linguistics, 1953. 
p. 29-33. 

LANNHOLM, GERALD V. “Educational Growth During the Second Two Years of 

College.” Educational and Psychological Measurement 12: 645-53; Winter 1952. 


. LEICHTY, VERDUN E. “What Makes a Test Item Bad?” Journal of Educational 
74. 


Research 48: 115-21; October 1954. 

Lennon, Rocer T. “The Test Manual as a Medium of Communication.” Pro- 
ceedings, 1953 Invitational Conference on Testing Problems. Princeton, N. J.: 
Educational Testing Service, 1953. p. 90-94. 

LENNON, RocerR T. “Testing: Bond or Barrier Between Pupil and Teacher?” 
Education 75: 38-42; September 1954. 

LINDGREN, HENRY C. “The Incomplete Sentences Test as a Means of Course 
Evaluation.” Educational and Psychological Measurement 12: 217-25; Summer 
1952. 

Lorce, IrviNc. “Individual Versus Group Decision Making.” Proceedings, 
1953 Invitational Conference on Testing Problems. Princeton, N. J.: Educational 
Testing Service, 1953. p. 36-42. 

Lorce, IRVING, and DIAMOND, LORRAINE K. “The Prediction of Absolute 
Item Difficulty by Ranking and Estimating Techniques.” Educational and 
Psychological Measurement 14: 365-72; Summer 1954. 

LorGE, IRVING, and DIAMOND, LorRAINE K. “Validity of an may Exami- 
nation for the Placement of Foreign Students in English Courses.” Journal 
of Educational Psychology 45: 208-14; April 1954 


. LoRGE, IRVING, and DIAMOND, LORRAINE K. “The Value of Information to 


Good and Poor Judges of Item Difficulty.” Educational and Psychological 
Measurement 14: 29-33; Spring 1954. 

LorcE, IRVING, and KRUGLOV, Lorraine. “The Improvement of Estimates of 
Test Difficulty.” Educational and Psychological Measurement 13: 34-46; Spring 
1953. 

Lorce, IRvING, and KRUGLOV, LORRAINE. “A Suggested Technique for _ 
Improvement of Difficulty Prediction of Test Terms.” Educational and Psy- 
chological Measurement 12: 554-61; Winter 1952. 











: 
A 
4 


February 1956 Tests OF EpuUCATIONAL ACHIEVEMENT 





83. McQuary, JoHN P. “Some Relationships Between Non-Intellectual Character- 
istics and Academic Achievement.” Journal of Educational Psychology 44: 
215-28; April 1953. 

94. Maize, Ray C. “Two Methods of Teaching English Composition to Retarded 
College Freshmen.” Journal of Educational Psychology 45: 22-28; January 
1954. 

85. MALLoy, JoHN. “An Investigation of Scholastic Over- and Under-Achievement 
among Female College Freshmen.” Journal of Counseling Psychology 1: 260-63; 
Winter 1954. 

86. MALLoy, JoHN. “The Prediction of College Achievement with the Life Ex- 
perience Inventory.” Educational and Psychological Measurement 15: 170-80; 
Summer 1955. 

87. MANUEL, HERSCHEL T. “The Use of Parallel Tests in the Study of Foreign 
Language Teaching.” Educational and Psychological Measurement 13: 431-36; 
Autumn 1953. 

88. MICHAEL, WILLIAM B., and REEDER, Doucias E. “The Development and 
Validation of a Preliminary Form of a Study-Habits Inventory.” Educational 
and Psychological Measurement 12: 236-47; Summer 1952. 

89. Myers, CHARLES T. “A Note on a Spatial Relations Pretest and Posttest.” 
Educational and Psychological Measurement 13: 596-600; Winter 1953. 

90. Myers, Rosert C. “Biographical Factors and Academic Achievement: An 
Experimental Investigation.” Educational and Psychological Measurement 12: 
415-26; Autumn 1952. 

91. NeDELSKY, LEo. “Ability To Avoid Gross Error as a Measure of Achievement.” 
Educational and Psychological Measurement 14: 459-72; Autumn 1954. 

92. NEDELSKY, LEO. “Absolute Grading Standards for Objective Tests.” Educational 
and Psychological Measurement 14: 1-19; Spring 1954. 

93. NELSON, ERLAND N. P. Persistence of Attitudes of College Students Fourteen 
Years Later. Psychological Monographs, No. 373. Washington, D. C.: Ameri- 
can Psychological Association, 1954. 13 p. 

94. NEWMAN, SIDNEY H.; FRENCH, JOHN W.; and Bossitr, JosepH M. “Analysis 
of Criteria for the Validation of Selection Measures at the United States 
Coast Guard Academy.” Educational and Psychological Measurement 12: 
394-407; Autumn 1952. 

95. NORMAN, RALPH D. “The Effects of a Forward-Retention Set on an Objective 
Achievement Test Presented Forwards or Backwards.” Educational and Psy- 
chological Measurement 14: 487-98; Autumn 1954. 

96. OSBORNE, R. TRAvis, and SANDERS, WILMA B. “Comparative Decline of 
Graduate Record Examination Scores and Intelligence with Age.” Journal 
of Educational Psychology 45: 353-58; October 1954. 

97. PEARSON, RICHARD. “The Test Fails as an Entrance Examination.” College 
Board Review, No. 25, Winter 1955. p. 2-9. 

98. RAPAPORT, GERALD M., and BERG, IRWIN A. “Response Sets in a Multiple- 
Choice Test.” Educational and Psychological Measurement 15: 58-62; Spring 
1955. 

99. Remmers, H. H., and Gace, NATHANIEL L. Educational Measurement and 
Evaluation. Revised edition. New York: Harper and Brothers, 1955. 650 p. 

100. Reynotps, Paut E. “Examinations—and Examinations.” Journal of Higher 
Education 25: 38-40; January 1954. 

Ross, CLay C. Measurement in Today’s Schools. Third edition, revised by Julian 
C. Stanley. New York: Prentice-Hall, 1954. 485 p. 

102. Rust, RatpH M., and Ryan, F. J. “The Strong Vocational Interest Blank 
and College Achievement.” Journal of Applied Psychology 38: 341-45; October 
1954. 

103. ScouLtTz, Douctas G., and GREEN, BERT F., JR. “Predicting Academic Achieve- 
ment with a New Attitude-Interest Questionnaire—II.” Educational and Psy- 

chological Measurement 13: 54-64; Spring 1953. 

SCHWELLENBACH, JOHN A. “An Experiment in Predicting the Ability of Eighth 
Grade Students To Work Simple Algebra Problems.” California Journal of 
Educational Research 5: 36-41; January 1954. 

105. SHERRIFFs, ALEX C., and Boomer, DONALD S. “Who Is Penalized by the Penalty 

for Guessing?” Journal of Educational Psychology 45: 81-90; February 1954. 


87 





101. 


104. 











Review OF EpUCATIONAL RESEARCH Vol. XXVI, No. 1 





106. 
107. 


108. 


109. 
110. 
111. 
112. 


113. 
114. 


115. 
116. 
117. 


118. 


119. 


120. 


121. 


122. 


123. 


124. 


88 


SHORES, J. HARLAN, and SauPE, J. L. “Reading for Problem-Solving in Science.” 
Journal of Educational Psychology 44: 149-59; March 1953. 

SMITH, DonaLp E., and GLock, MARvIN D. “Measuring Knowledge and Ap- 
plication: An Experimental Investigation.” Journal of Experimental Educa- 
tion 21: 327-31; June 1953. 

SWINEFORD, FRANCES, and MILLER, PETER M. “Effects of Directions Regarding 
Guessing on Item Statistics of a Multiple-Choice Vocabulary Test.” Journal 
of Educational Psychology 44: 129-39; March 1953. 

Symonps, PercivAL M. “Pupil Evaluation and Self-Evaluation.” Teachers 
College Record 54: 138-49; December 1952. 

THORPE, Louis P.; LEFEVER, D. WELTY; and NASLUND, Rorert A. SRA 
Achievement Series. Chicago: Science Research Associates. 

TIEDEMAN, Davip V., and BRYAN, JosepH G. “Prediction of College Field 
of Concentration.” Harvard Educational Review 24: 122-39; Spring 1954. 
TORGERSON, WARREN S., and GREEN, Bert F., Jr. “The Factor Analysis of 
Subject-Matter Experts.” Journal of Educational Psychology 43: 354-63; October 

1952. 

TRAVERS, ROBERT M. W. Educational Measurement. New York: Macmillan Co., 
1955. 420 p. 

TRAVERS, ROBERT M. W. An Inquiry into the Problem of Predicting Achievement. 
Lackland Air Force Base, Texas: Air Force Personnel and Training Research 
Center, Air Research and Development Command, 1954. 32 p. 

TRAXLER, ARTHUR E. “The Use of Tests in Differentiated Instruction.” Educa- 
tion 74: 272-78; January 1954. 

TRAXLER, ARTHUR E., and TOWNSEND, AGATHA E., editors. Improving Transi- 
tion from School to College. New York: Harper and Brothers, 1953. 165 p. 
Trices, FRANCES O. “The Development of Measured Word Recognition Skills, 
Grade Four Through the College Freshman Year.” Educational and Psy- 

chological Measurement 12: 345-49; Autumn 1952. 

TUCKMAN, JACOB, and LorGE, IRvING. “The Influence of a Course on the 
Psychology of the Adult on Attitudes Toward Old People and Older Workers.” 
Journal of Educational Psychology 43: 400-407; November 1952. 

TUCKMAN, JACOB, and LorGE, IrviNG. “The Influence of Changed Directions 
on Stereotypes about Ageing; Before and after Instruction.” Educational and 
Psychological Measurement 14: 128-32; Spring 1954. 

TYLER, RALPH W. The Fact-Finding Study of the Testing Program. Madison, 
Wis.: U. S. Armed Forces Institute, 1954. (Mimeo.) 304 p. 

U. S. DEPARTMENT OF THE ARMY, THE ADJUTANT GENERAL’S OFFICE, PERSON- 
NEL RESEARCH BRANCH. The Development of a Sound Motion Picture Pro- 
ficiency Test. Personnel Research Branch Report No. 991. Washington, D. C.: 
American Documentation Institute, Library of Congress, 1953. 96 p. 

WAGNER, Mazie E. “Anticipatory Examinations for College Credit: Twenty 
Years Experience at the University of Buffalo.” University of Buffalo Studies 
20: 107-33; 1952. 

Wess, Sam C. “The Validity of a Generalized Scale for Comparing Interest 
in Natural Science Subjects.” Educational and Psychological Measurement 
12: 472-89; Autumn 1952. 

WitiiaMs, HENRIETTA V., and McQuary, JoHN P. “The High-School Per- 
formance of College Freshmen.” Educational Administration and Supervision 
39: 303-308; May 1953. 





CHAPTER VI 


Development of Statistical Methods Especially Useful 
in Test Construction and Evaluation 


WILLIAM B. MICHAEL 


Berween August 1, 1952, and July 30, 1955, there appeared a sub- 
stantial number of developments in statistical methods that were particu- 
larly applicable to the analysis, evaluation, and construction of tests. 
Altho the recent issue of the Review, “Statistical Methodology in Educa- 
tional Research” (1), so ably prepared under the chairmanship of Helen 
Walker, constituted a comprehensive work of inestimable value to in- 
dividuals engaged in educational research, it seemed that there were a 
large number of additional significant contributions in statistical method- 
ology that were especially oriented to the analysis and evaluation of 
characteristic properties of tests. By far the majority of references that 
will be cited were not mentioned in the issue devoted to statistical method- 
ology. The few duplications were papers considered important in dealing 
with a given topic. 

Perhaps almost half of the articles to be cited are concerned either 
with new or with modified statistical procedures that can be employed in 
the analysis and/or selection of test items, or with the development of 
such computational aids as tables, charts, or graphs that effect con- 
siderable savings in time and effort required of the test technician. A great 
deal of significant work, somewhat theoretical in emphasis, appeared in 
the area of reliability estimation. In addition, much fundamental re- 
search that permits an estimation of the influence of sampling error upon 
the stability of various item and test statistics was completed. Important 
studies were published in the realm of prediction, especially in conjunc- 
tion with certain adaptations in multiple regression technics. Noteworthy 
statistical developments occurred in such diversified areas as the trans- 
formation or adjustment of scale values, the influence of format of 
test items upon test correlations, and profile and pattern analysis. 

Finally, a number of empirical investigations yielded evidence con- 
cerning the effectiveness of various statistical procedures when they are 
applied to the analysis and evaluation of item and test data. Only those 
empirical studies have been included that in terms of their design and 
scope seem to offer significant evidence regarding the applicability of 
various statistical technics to problems of test construction and evaluation. 


New or Modified Procedures for Item Selection 


A number of somewhat different statistical approaches—representing ~ 
either new technics or substantial modifications of fairly well-known 


89 











ReEvIEW OF EpuUcATIONAL RESEARCH Vol. XXVI, No. 1 





ones—to the selection of test items for the purposes of maximizing test 
reliability and/or validity, or of finding groups of homogeneous items 
appeared. Altho in many of the developments to be described, moderate 
to substantial savings might be effected with respect to the computational 
labor involved, the intended emphasis will rest primarily upon the 
methodological importance of the procedure in item and test analysis 
and secondarily upon computational savings effected. In the next section 
the principal attention will be given to the development of economical 
methods of estimation of item and test statistics. Obviously, however. 
there is a considerable amount of overlap in that the same contribu. 
tions may be both methodologically and computationally useful. 

In his discussion of the selection of a set of items from a larger pool 
of items so that the selection composite will have a maximum degree of 
validity with an external criterion, Green (28) compared formulations 
by Gulliksen and Horst, and cited conditions under which Gulliksen’s 
and Horst’s indexes of validity either overestimate or underestimate the 
extent of the correlation between the selected group of items and the 
criterion variable. For any fixed ratio between the standard deviation of 
the subtest u, made up of the unselected (discarded) items, to the standard 
deviation of subtest s, composed of selected items, Green was able to find 
a critical value in the correlation (r,,) between the selected and unselected 
subtests. When the magnitudes of r,, exceed the critical value, Gulliksen’s 
index is a superior approximation to the correlation r,, between the 
criterion and subtest. However, Horst’s index is better for values in r,, 
below the critical point. Caution is suggested in application of the pro- 
cedure when the ratio is fairly large. 

For selecting items that represent categorical data (as found in appli- 
cation forms or biographical data blanks) from a large pool of items 
such that the surviving ones will exhibit the maximum possible relation- 
ship with the criterion variable, Friedman (26) worked out an empirical 
approach that he called the “quartile difference method of item selec- 
tion.” The procedure is somewhat analogous to the Wherry-Doolittle ap- 
proach. 

Loevinger, Gleser, and DuBois (44) described and illustrated a tech- 
nic for maximizing the discrimination power of a multiple-score test. In 
essence the procedure necessitates maximizing the degree of homogeneity 
of each subtest, which initially consists of a nucleus of three items with 
high covariances inter se, and minimizing the amount of correlation be- 
tween subtests. Items are added to and dropped from each separately keyed 
subtest in order that the degree of saturation (ratio of inter-item co- 
variance to total variance) may be kept as high as possible. Eliminating 
those items that lower the amount of saturation in a subtest and adding 
those items one by one that will maximize the saturation of a subtest, 
one continues the process until all items have been either incorporated 
within or discarded from a given subtest. Whenever the amount of correla- 
tion between subtests approximates the geometric mean of their respec- 
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tive saturations, the items are combined as a new pool. In order that 
correlations between subtests may be reduced as much as possible without 
loss in the saturation of the subtests, formulas are presented for aiding 
the technician in his decisions. 

A factor-analytic approach to a problem not too dissimilar to the one 
investigated by Loevinger, Gleser, and DuBois was that described by 
Wherry and Winer (69). The latter two investigators proposed a theo- 
retical model for the estimation of the factor loadings of a large number 
of items, or tests, which is a special case of the multiple-group centroid 
method of factoring. The formidable task involving the intercorrelations 
of the original variables can be circumvented. Despite the presence of 
several exacting assumptions, the writers declared that the model could 
be applied to psychological factor problems in which a variety of variables 
may be found. 

For selection of items that will maximize the degree of correlation be- 
tween total test scores and criterion scores, Webster (67) proposed a 
nonparametric procedure that requires only item counts on the total 
test and criterion distributions and the number of responses of a designated 
type (correct or preferred) above the medians of the two distributions. 
In light of the apparent computational simplicity of the procedure when 
standard test-scoring equipment is available, the author implied that the 
statistical inefficiency of the procedure could be discounted. 

Defining a distortion of measurement as an error that may be correlated 
with other systematic errors, but not with the true score, Loevinger (43) 
introduced a correlated error term into the basic linear equation for repre- 
senting test scores as a function of a true factor and a random error term. 
The amount of distortion was noted in the use of such methods of item 
selection as the one in which item-criterion correlations are found and 
the one in which homogeneity-based approaches are followed. In the 
former method it was shown that items with or without distortions 
cannot be distinguished. However, higher criterion correlations may be 
realized if items without distortions are selected. In the latter approach 
it was shown that items with distortions tend to be selected over those 
not containing distortions. When the ratio of the criterion correlation 
to a measure of homogeneity (actually the square root of Kuder-Richard- 
son’s Formula 20) was taken, it was shown that for tests containing the 
same true factor, constancy in the ratio could be achieved under certain 
conditions when distortions were absent. Consequently the criterion- 
homogeneity ratio could be used to ascertain the relative degree to which 
several tests are weighted with distortions. 

Loevinger proceeded to point out certain relationships betnsnin her 
findings in the area of distortion and those which appeared in the pre- 
viously cited paper by Loevinger, Gleser, and DuBois (44) and in articles 
by Brogden (9) and Guttman (31). After summarizing the method of 
constructing homogeneous subtests proposed by herself and two others, - 
she mentioned that, since any two tests could be assumed to measure 
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the same true factor if their correlation approximated the geometric mean 
of the Kuder-Richardson Formula 20 values, then the criterion-homogeneity 
ratio could be used to ascertain which of two clusters of items should 
be employed as the nucleus about which a test of that factor could be 
constructed. Presumably, if the ratio is equal for the two tests, the one 
for which higher absolute values exist in the numerator and denominator 
of the ratio is superior. On the other hand, if the ratios differ, the test 
with the higher ratio is probably the better one since the existence of 
distortions would serve to increase the Kuder-Richardson Formula 20 
values and to lower the amount of criterion correlation. 

With respect to Guttman’s paper (31) regarding the estimation of 
reliability from formulas in which experimental independence was not 
assumed, Loevinger stated that Guttman interpreted the correlated error 
term as experimental dependence of part scores, including items as a 
special case. Challenging the practical value of test-retest reliability 
estimates which were implied by Guttman’s paper, Loevinger pointed 
out that advance knowledge about the nature of the experimental de- 
pendence among part scores would be required in Guttman’s approach, 
whereas in hers no information as to the degree of correlation of errors 
was assumed. She concluded that access to the less obvious sources of dis- 
tortion could be realized thru correlation of items or tests with outside 
criteria. 

To minimize the amount of distortion in the keys of personality ques- 
tionnaires, Brogden (9) formulated a rationale that made use of sup- 
pressor items in order to reduce the amount of distortion in and to en- 
hance the validity of the scoring keys. Intended primarily for application 
to the construction of forced-choice items, the procedure was tested 
empirically. An experiment with forced-choice items revealed that, when 
a key had been developed by the new procedure, a validity of .33 was 
realized, but that, when a second key had been developed by somewhat 
more conventional approaches, a significantly lower validity of .23 was 
obtained. In a footnoie to her paper upon distortion, Loevinger (43) ex- 
plained Brogden’s index of distortion for items to be the biserial correla- 
tion of the item with the difference between test score and test score as 
predicted from criterion score. In analyzing the numerator of Brogden’s 
distortion index, Loevinger showed that when the criterion does not 
contain an error factor, the index will increase only as the distortion in 
the item increases. In the usual instance of a fallible criterion, the dis- 
tortion index would rise as the amount of distortion in the item increases 
and also as the weight of the true factor in the item becomes greater. 

Of considerable interest to the individual engaged in the selection of 
test items for homogeneous tests was another paper by Loevinger (42) 
that dealt with the attenuation paradox. The attenuation paradox was 
interpreted to mean essentially that in a homogeneous test devised to 
measure a single function the amount of validity of the test (its degree 
of correlation with the common factor) actually decreases as the magnitude 
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of the reliability, which is reflected by the average amount of item inter- 
correlations, increases beyond a certain point. The interval of reliabilities 
below which an increase in reliability results in an increase in validity 
was referred to as the “classical region”—in other words, the interval 
within which the attenuation of validity lessens with an increase in reli- 
ability. The interval within which the attenuation in validity occurs is 
known as the “region of paradox.” It was pointed out that the attenuation 
paradox could be resolved thru making use of two rules of test con- 
struction. Clearly for the classical region, the closer that items can be to 
the difficulty level of .50 and consequently to equivalence, the greater both 
the reliability and the validity will be. In the region of paradox it would 
be necessary to obtain the optimal distribution of item difficulties in rela- 
tion to item intercorrelations. 

Giving systematic consideration to several studies, the contents of which 
throw some light upon the attenuation paradox in test theory, Loevinger 
evaluated this phenomenon with respect to the level and distribution of 
item difficulties and to the degree of intercorrelation of items, and suggested 
certain applications to test construction. For tests of low homogeneity 
(those in the classical region) it was recommended that to obtain a lower 
bound of the validity coefficient, the sample on which the test is to be 
standardized should be no more variable than the sample on which it will 
be used. From a consideration of the distribution and level of item difficul- 
ties and the intercorrelations of items, it was deduced for a test in the 
region of paradox that if a lower bound for the validity is to be realized, 
the variability of the standardization sample should be at least as great as 
that of any sample to which the test will subsequently be administered. 
From the standpoint of both theoretical and practical considerations it 
was concluded that, if application to only one group is planned, concen- 
tration of item difficulties should be encouraged when the intercorrela- 
tions of items are low. If tests are to be administered to groups that differ, 
somewhat greater dispersion of item difficulties can be expected when the 
item correlations are not too low. Since in practice, however, it is un- 
likely that very many items will be found at exactly the .50 level, it is 
probably a fortunate circunistance that the almost inevitable dispersion of 
item difficulties takes place. From her considerations Loevinger concluded 
that a method of item selection that favors inclusion of items of median 
level of difficulty, but does not exclude “good” items at more extreme 
levels of difficulty would be indicated. 

In an empirical study Pickrel (61) evaluated the relative predictive 
efficiency of three methods of keying the items within a biographical in- 
ventory that was administered to 4461 graduates of technical schools of 
the Air Force. Two of the methods were based on an approach to the 
homogeneous keying of items. The principal difference between the two 
methods was that in one a linear multiple-regression technic was employed 
in combining scores on subtests and in the other a novel pattern technic 
was applied to predict the criterion scores. In the pattern technic two 
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quantified series, (a) the number of subtest variables and (b) the num- 
ber of intervals within the distributions of scores on the variables (such 
as high, middle, or low), were systematically varied in order that the 
degree of efficiency of different combinations of test variables and numbers 
of groupings of scores within each test variable might be compared. The 
third method involved the direct empirical selection of items relative to 
their degree of correlation with the criterion, but irrespective of their 
psychological meaningfulness or similarity of content. It was found that 
combining scores from homogeneous keys led to higher validity coefficients 
than those realized thru use of a heterogeneous key when the numbers of 
items for each key were comparable. The linear multiple-regression 
technic in general yielded somewhat higher validity coefficients on cross 
validation than the unique pattern approach altho the latter technic tended 
to be superior with the initial experimental groups. 

Much less theoretical in its orientation than several other papers deal- 
ing with item selection was the article by MacLean and Tait (46). They 
described a procedure intended to facilitate not only the calculation of 
various item and test parameters, but also the selection of those items 
that contribute positively to the internal-consistency reliability and the 
rejection of those items that detract from the reliability of the test. It 
occurred to the reviewer that the particular order in which an item ap- 
peared might influence to some degree the decision as to whether it 
should be retained or rejected. It would seem reasonable that there 
would be times when the relative contribution of an item to the true 
variance of the test would depend upon which items had already been 
chosen. 

With the intention of aiding both the unsophisticated users of tests and 
the statistically sophisticated test specialist in the evaluation of the effective- 
ness of items, Findley (24) introduced a new index of item discrimina- 
tion that is directly a function of the difference in the numbers of cor- 
rect responses given by individuals in each of two extreme criterion groups 
of equal size. In terms of both a logical analysis and a simple mathematical 
formulation the author demonstrated that his index furnishes an efficient 
means for evaluating the relative degree of effectiveness of the items within 
a test. Altho Ebel (19) proposed that for the guidance of classroom in- 
structors the difference between the number of correct responses in the 
upper group and the number of correct responses in the lower group 
would suffice for item evaluation, Findley showed that the difference 
between the number of correct item discriminations and the number of 
incorrect discriminations yields a number between zero and unity that is 
the same as the difference between the proportions of correct responses 
given by individuals found in the upper and lower criterion groups, 
respectively. 

If U and L respectively designate the numbers of individuals in the 
upper and lower criterion groups each of size n who are respectively right 
in their responses to an item and if n-U and n-L constitute the numbers 
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who are respectively wrong in their answers to an item, then the numbers 
of correct discriminations and incorrect discriminations are given, corre- 
spondingly, by U(n-L) and L(n-U). When the difference between the num- 
ber of correct and the number of incorrect discriminations is divided by 
n? (the maximum number of correct discriminations realized when all 
individuals in the upper group do respond with the right answer and 
none in the lower group does) it can be seen that subsequent algebraic 
simplification will lead to the expression (U-L)/n, which is merely the 
difference between the proportions of correct responses of individuals 
in the upper and lower groups. Findley not only illustrated the use of his 
index in both the over-all evaluation of the effectiveness of a test item 
and the detection of needed revisions among various alternatives of a 
wultiple-choice item, but also cited item-analysis situations in which his 
index could be more meaningfully applied than could the point biserial 
coefficient. Because of its simplicity, the index should prove to be of con- 
siderable use to teachers who wish to carry out periodic item analyses 
upon their examinations with a view toward retaining certain items and 
revising others. 

Two carefully conducted empirical studies of the use of sequential 
analysis in the selection of test items were noteworthy. Anastasi (2) ap- 
plied sequential-analysis procedures to 46 items of an achievement test 
in geometry administered to high-school students, and to the items of a 
nonverbal, projective personality test of 25 scorable characteristics given 
to Air Force student pilots. In addition, Pearson correlations based on use 
of the upper and lower 27-percent groups were estimated for the geometry 
test, and for the other test phi coefficients were derived from the upper 
and lower criterion groups, each of which consisted of 50 cases. Close 
agreement was found in the decisions rendered regarding the accepta- 
bility of items when the sequential-analysis and more traditional pro- 
cedures were employed. An important consideration cited in the degree of 
success to be realized from the application of sequential analysis to item 
selection was the amount of separation between the two extreme groups. 

Employing a modified method of sequential analysis to ascertain 
whether items fell within a chosen range of difficulty, Burgess (10) ad- 
ministered several short tests each consisting of 20 or 25 items to classes 
of 24 to 30 students under ample time limits with the intention of de- 
veloping a 75-item test. For the total of 213 items analyzed it was found 
that the modal number of subjects tested was between 40 and 59 before 
a decision regarding the acceptance or rejection of an item was reached. 
For more than half of the items the decision regarding retention or rejec- 
tion of the item was achieved by the time 60 subjects had been tested, 
and of the remaining items approximately half were subsequently rejected. 
The author recommended that the procedure be employed in those in- 
stances in which time for testing is limited and in which empirical valida- 


tion either is not required or by necessity is carried out in a subsequent 
stage of analysis. 
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Computational Aids to Item Analysis 


Numerous abacs, charts, graphs, tables, and computational procedures 
for the estimation of indexes of item discrimination and item difficulty 
were presented. Thru use of the suggested procedures the test technician 
probably can save much time and effort that was previously required in 
many item-analysis activities. In many instances numerous valuable 
insights may be gained as to the interrelationships of various item and 
criterion properties. 

It would appear that among the indexes of item discrimination the 
tetrachoric coefficient has received perhaps more attention recently than 
any of the others. Employing essentially a correction for coarseness of 
grouping, Perry and Michael (56) developed a formula for estimating a 
tetrachoric coefficient from a phi coefficient that is calculated from use 
of high and low (extreme) groups of a total criterion sample. It is required 
that the numbers of individuals in each of the two contrasted groups be 
equal in number. When the proportion of cases in each of the two 
extreme groups is .27, the magnitude of phi closely approximates that 
of the tetrachoric coefficient. For items near the 50-percent level of diffi- 
culty the formula yields a rapid means of obtaining a fairly accurate 
estimate of the tetrachoric coefficient from the easily calculated phi 
coefficient. A detailed tabulation of the error arising from use of the 
formula was reported in a separate paper (52). 

Probably of somewhat greater practical value to the test technician 
than the formula cited is a set of abacs, prepared by Michael, Hertzka, 
and Perry (49), that permit the estimation of a tetrachoric coefficient 
from a phi coefficient when the proportions of individuals of the total 
criterion sample in each of the two contrasted groups is either .5000 or 
.2743. Based on the entries furnished by Pearson’s tables of a normal 
bivariate surface, the abacs require a knowledge of the level of item diffi- 
culty and the size of the phi coefficient. 

Hsii (40) described and illustrated the use of a single nomograph for 
estimation of the tetrachoric coefficient that permits various combinations 
of splits in each of the two correlated variables. It was contended that 
coefficients obtained from the nomograph, which consists of rectangular 
scales and 19 groups of curves to cover the principal dichotomy combina- 
tions, are as accurate as those determined from the familiar Thurstone 
diagrams. Altho the author has achieved a noteworthy advance, it would 
appear that the legibility of the nomograph could be somewhat improved. 
For the special case in which variables are dichotomized at the median, 
Welsh (68) described and illustrated a tabular method that yields an 
immediate estimate of the tetrachoric coefficient from knowledge of the 
proportion of cases in the plus-plus cell. 

To estimate a tetrachoric coefficient from use of the cosine-pi approxi- 
mation, Davidoff and Goheen (17) and Perry and others (59) independ- 
ently arrived at a comparable tabulation of the ratio ad/be, where a, b, d, 
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and c are the frequencies in each of the quadrants formed by the fourfold 
table, and reported the values of tetrachoric coefficients that correspond 
to the tabulated ratios. Slight errors in one of the tables in the paper 
by the first group of authors were acknowledged and corrected in a note 
(16). The second group of writers also constructed 12 graphs that allow 
a correction in the estimated value of the tetrachoric coefficient when 
either one or both of the correlated variables is not dichotomized at 
the median. 

Bouvier and others (7) presented a tabulation of the error in the 
cosine-pi approximation relative to a representative selection of points 
of dichotomy in each of the two correlated variables, and from the 
pattern of error that was observed drew seven conclusions. If dichotomiza- 
tion at the median of each variable is not possible, suggestions were made 
as to how the points of dichotomy might be selected to minimize the 
amount of error in the approximation. 

One of the most significant contributions was that by Fan (21, 22) who 
constructed an item-analysis table for the estimation of a tetrachoric 
correlation coefficient when the upper and lower 27-percent groups of a 
criterion sample are employed. From knowledge of the observed propor- 
tions of examinees who are “successful” in the two groups (pq and pz) 
indexes of item difficulty p for the total criterion group and of item 
discrimination 5 and r are read from the table. The tabular entries for 
both p and 8 were derived from Pearson’s familiar tables of the normal 
bivariate surface. 

It is interesting to note in Fan’s descriptive paper (22) two charts 
that permit a graphic estimate of p from given magnitudes of py and r 
and a graphic estimate of both p and r for observed values of py and pr. 
From the second chart it was interesting to note that the average of 
Px and p,, which is often taken as an estimate of item difficulty in validity 
studies, constitutes a value closer to .50 than that furnished by p, the index 
for the total sample. This fact was reported in an earlier paper by 
Michael, Hertzka, and Perry (50), who suggested that for items of low 
or moderate validities (r less than approximately .60) the discrepancy 
between p and (py +-pr,)/2 is negligible. Corrections in estimates of item 
difficulty derived from use of extreme groups were recommended only 
when item validities are greater than .60. 

To permit a rapid estimate of the point biserial coefficient as well as 
to show the relationship between the point biserial ccefficient and the phi 
coefficient, Michael, Perry, and Guilford (51) derived formulas applicable 
to the familiar situation of item analysis in which the proportion of indi- 
viduals in each of the two extreme criterion groups is equal. Thru use of 
Pearson’s tables the degree of systematic error in the formulas could be 
ascertained as a function of selected levels of item difficuliy and of 
corresponding magnitudes of the tetrachoric coefficient. From the entries 
within the tables it is possible to interrelate the magnitudes of the. 
tetrachoric, phi, and point biserial coefficients relative to different levels 
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of item difficulty when the proportions of individuals in each of the two 
extreme groups are either .5000 or .2743. 

Making use of one of the formulas presented by Michael, Perry, and 
Guilford in the previously cited paper, Dingman (18) developed a chart 
for estimation of a point biserial coefficient from the proportions of in. 
dividuals in the upper and lower subgroups (upper and lower halves) of 
a total criterion sample who answer an item in a designated manner. 
This chart should serve to effect a substantial economy in the amount of 
time usually spent in the calculation of point biserial coefficients. 

Three item discrimination procedures concerned with the rapid estima. 
tion of the significance of the differences in proportions or frequencies 
of responses of individuals in two contrasted groups were reported. Appel 
(5) described and illustrated the use of two companion nomographs for 
testing the significance of the differences between uncorrelated percentages. 
In addition, the two nomographs can be employed to determine the size 
of a sample necessary for a given percentage difference to be considered 
significant. Kirkpatrick and Cureton (41) presented four tables, the en- 
tries in which constitute what the sums and differences in numbers of 
correct responses in two 27-percent criterion groups of four different 
sizes have to be for significance of item discrimination at the .05, .01, and 
001 levels. The numbers of individuals in each of the four pairs of high 
and low criterion groups are 25, 50, 100, and 200. Altho the writers 
recommended that each of these extreme groups should include 27 percent 
of the total experimental sample, they mentioned that each group might 
contain between 18 and 36 percent without substantial loss in the efficiency 
of the procedure. Still another useful approach is a graphic method de- 
veloped by Petersen (60) for the estimation of the significance of differ- 
ences between proportions or percentages for independent and for cor- 
related samples. It is necessary that the size of the two samples being com- 
pared be the same and that a limit of 1000 cases for the total N not be 
exceeded. 

Subsequent to deriving a formula for determination of the coefficient 
of correlation of an item with the composite of items remaining in a test, 
Guilford (29) furnished four abacs for the ‘stimation of this coefficient 
providing for different amounts of item-total orrelation, different values 
in the standard deviation of the total test scores, and specified levels of 
item difficulty. The abacs should serve to yield a quick means for deter- 
mining the approximate amount of correlation between an item and the 
total test that would be spurious as a consequence of the item’s consti- 
tuting a part of the total test variance. 

From the standpoint of recent developments in IBM technics two 
contributions that may be applied to item or test analysis appear to be 
noteworthy. Caffrey and Wheeler (11) described and illustrated an IBM 
procedure for computation of a chi square value as a function of cell 
frequencies when the criterion variable was dichotomized such that two 
groups of equal size resulted and when the continuum underlying the 
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_ responses to the item was alse dichotomized. Payne and Staugas (55) 

| proposed an IBM method for the calculation of intraserial correlations 

| that may prove to be of value in certain types of item-analysis or factor- 

analysis studies in which the responses to items or tests may be viewed as 
aspects of behavior in a time series. 

f For the situation in which a tabular or IBM electronic statistical 

machine is available, MacLean and Tait (47) outlined several computa- 

f tional short cuts that might be followed in the development or analysis of 

_ tests. They described what they believed to be economical methods of 

(a) arrangement of the data in workable form; (b) calculation of means, 

variance, and Kuder-Richardson reliability; (c) determination of in- 


| dexes of item difficulty and/or mean level of item difficulty; (d) com- 
r putation of item-test correlations; and (e) calculation of an index of 
' selection that was developed elsewhere by the same writers (46). Sug- 
; gestions were made for economies when machine facilities are limited. 


—— 


For the calculation of Kendall’s Tau-coefficient, useful in the correlation 
of test scores expressed in ranks, Bright (8) devised a card sorting 
method that should effect considerable savings in time and effort. In view 
of the recent emphasis upon nonparametric methods of statistics it seems 
likely that adaptations of such procedures to problems of item and test 
analysis will become numerous in the next few years. 





Estimation of Test Reliability 


A substantial amount of diversified research on the estimation of test 
reliability appeared. Many of the formulas were developed on the basis 
of carefully formulated theoretical considerations. Several extensions of 
internal-consistency approaches to reliability were noted. Considerable 
. attention was given to the influence of experimental dependence in reli- 
6 ability estimation. 

To meet the objections to inappropriateness of the explicit or implicit 


i ee ee ed 





t assumption made of the experimental independence of item scores that 
, are summed when various internal-consistency methods of reliability esti- 
t mation are followed, Guttman (31) developed some general reliability 
s formulas that require no assumptions regarding experimental inde- 
f pendence. Referring to his basic and exact formulas as mathematical 
: identities or tautologies, he pointed out that they were of no immediate 
e practical use—a fact with which Loevinger (43) concurred in her pre- 


\- viously cited paper when she implied that the degree of correlation among 
errors in part scores and in the items would need to be known in ad- 
vance. By modifying his tautologies, Guttman arrived at four practical 


p formulas that furnish lower bounds for the reliability coefficient from in- 
f formation given by only a single administration of a test. In terms of 
ll certain assumptions about serial dependence of test items, the manner in 

which the third formula could be applied to the estimation of a lower. 
e bound of the reliability of speeded tests was outlined. 
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Guttman (30) also developed new formulas to furnish lower bounds to 
the reliability of a test even tho not all examinees may try all items. In 
other words the formulas (in which, incidentally, it was assumed that 
no negative scorces were assigned to items) could be applied to either 
speed or power tests. It was shown that for completed tests the formulas 
yield the same values as certain standard ones. The estimates given by the 
formulas may be used with respect to both retest and parallel-form types 
of reliability. 

Working within the framework of parallel tests, Angoff (4) developed 
formulas for the estimation of the reliability of a speeded test ¢ from its 
correlation with a shorter parallel test i which is separately timed. In ad- 
dition, reliabilities of power tests may be estimated by correlating a sub- 
set of items (test j) within the total test ¢ or by correlating two com- 
plementary parallel parts h and j of the total test. A measure of the 
functional or effective test length that could be incorporated within the 
reliability coefficient was derived relative to data yielded by a test instead 
of in terms of arbitrary ratios of either numbers of items in the two tests 
or ‘respective time lengths. 

Horst (32, 33, 36) wrote three important articles that are concerned in 
one way or another with Kuder-Richardson reliability formulas. In one 
paper (36) he demonstrated the relationships among several of the Kuder- 
Richardson formulas. It was shown, for example, that the Kuder-Richard- 
son Formula 20 could never be larger than Formula 14, altho the two 
might be approximately equal despite the presence of a substantial range 
of item difficulties. Of considerable practical value to the estimation of 
Kuder-Richardson reliability was the development, as well as an illustra- 
tion, of a new formula that takes into account the dispersion of item 
difficulties (32). All other assumptions pertinent to Kuder-Richardson 
estimates should be fulfilled. The third significant contribution was the 
derivation of a formula for the estimation of the reliability of an un- 
speeded test relative to a minimum lapse of time between the first and 
second administration (33). The formula would be expected to yield a 
higher estimate than the Kuder-Richardson Formula 20, since the pres- 
ence of a dispersion in item difficulties or in item heterogeneity does not 
seem to attenuate its value. 

Three other contributions concerning internal-consistency estimates of 
reliability should be noted. Coffman (12) extended Hoyt’s analysis-of- 
variance procedure, in which items were scored 1 or 0, to the situation 
in which weights of 2, 1, and 0 were assigned to item responses. Hoyt 
and Stunkard (39) extended Hoyt’s original analysis-of-variance tech- 
nic for estimation of test reliability to the situation in which any number 
of differential weights may be assigned to item responses. Discussing the 
occurrence of negative reliabilities in an internal-consistency approach, 
Cronbach and Hartmann (15) pointed out that negative indexes can occur 
by chance when the true inter-item covariance is zero. Two recommenda- 
tions were made. For tests with negative half-test correlation coefficients, 
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split-half correction formulas should not be applied. In the instance of 
Kuder-Richardson formulas a negative sum of item covariances should 
not be used. Altho a test with negative internal consistency could represent 
measures of opposing factors, it would be of doubtful utility as a measure 
of a psychological trait. 

Two other articles concerned with somewhat different problems of reli- 
ability constitute noteworthy additions to the literature. Rummel (63) 
derived a formula that expresses a direct relationship between the reli- 
ability of differences in two test scores and the proportion of differences 
in excess of chance. From the formula a table was developed that yields 
for steps of .02 in the reliability of the differences between the scores, 
the proportion of differences in scores that cannot be accounted for by 
chance. Noble (53) demonstrated ana’ ‘cally and verified empirically 
that scale reliability as_a function of th. number of judges, n, who are 
evaluating by the method of single stimuli an appropriate attribute con- 
sisting of V objects or events can be determined from application of the 
Spearman-Brown formula, provided, of course, that certain assumptions 


can be fulfilled. 


Evaluation of Sampling Errors in Item 
and Test Analysis 


In test construction and evaluation the stability of both item and test 
statistics is obviously an important consideration. Altho only a limited 
amount of research appeared recently, some significant contributions re- 
sulted. 

Despite the fact that test theorists for a long time have been aware 
of the presence of sampling fluctuations as a consequence of the sam- 
pling of test items as well as of the sampling of the examinees, the 
first fundamental research upon the problem appears to be the 
penetrating formulation by Lord (45). For conceptual convenience Lord 
defined two types of samples. Type-1 samplitg occurs when a test is ad- 
ministered to a large number of separate samples of examinees from a 
common population. The standard deviation of the test statistic relative 
to a very large number of samples is the familiar standard error of the 
test statistic. On the other hand, if numerous forms of the same test, each of 
which consists of a random sample of items selected from a common 
universe of items, were given to the same group of examinees in such a 
way that practice and fatigue effects could be controlled, the standard 
deviation of a given test statistic calculated individually for each form 
would be the standard error of the test statistic in a type-2 sampling 
situation. In this second situation the tests were referred to as randomly 
parallel forms or randomly parallel tests. After showing that the Kuder- 
Richardson reliability coefficients are a measure of the size of type-2 
sampling errors and that the standard error of measurement of an individ- 
ual examinee’s score could be simply determined from the number of items 
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in a test and the observed score of an examinee, Lord proceeded to 
demonstrate that the type-2 sampling distribution of most test statistics 
should be approximately normal, providing the sample is sufficiently large. 
For type-2 sampling procedures, the standard errors of various test 
statistics were derived. The limited consideration given to the simultaneous 
and independent sampling of examinees and items indicated that the 
sampling variance of the mean and of the variance of test scores should be 
approximately equal to the sum of the type-] and type-2 sampling variances 
of the respective statistics. 

The reliability of the point biserial coe ficient was considered in two 
papers by Perry and Michael (57, 58). Subsequent to an elaboration upon 
Lev’s results concerning the reliability of a point biserial procedure, they 
described how the noncentral ¢ tables prepared by Johnson and Welch 
may be employed in the determination of the fiducial limits of the point 
biserial coefficient. In the second article, they presented a tabulation of the 
5-percent and 1-percent fiducial limits. 

In attempting to ascertain the influence of both the size of criterion 
groups and the level of significance employed in the selection of discrim- 
inating items upon the extent of shrinkage in test validity occurring in 
cross validation, Feldman (23) undertook an empirical study in which he 
used the .05, .02, and .01 levels of significance for item selection on two 
different-sized criterion groups. As one would expect, he concluded that 
even when the critical ratio indexes are quite high, cross validation is 
required and that with the use of small criterion groups marked shrinkage 
in test validity would be likely to occur. 

Questioning the somewhat narrow range of significance levels employed 
by Feldman in the selection of items, Appel and Kipnis (6) undertook a 
well-designed experiment in which items significant at the 50-, 20-, 5., 
and 1l-percent levels of confidence were selected from each of three 
criterion groups of size 80, 150, and 300 and were cross validated on nine 
independent groups each of size 60. Altho the results of the study indicated 
that there was no one optimal level of confidence at which items should be 
selected, it was not:worthy that scoring keys based on the acceptance of 
the 50-percent level yielded higher validities than those consisting of items 
that discriminated at the 1-percent level. In the two larger initial groups 
upon which items were validated, the highest validities tended to occur 
on cross validation for keys derived from use of the 20-percent level. 
There appeared to be the suggestion of a trend; namely, that with 
smaller criterion groups a more lenient (less rigorous) level of significance 
might be employed than when criterion groups were larger. 


Prediction Technics 


During this period there appeared five or six articles, the contents of 
which represented aspects of statistical methodology that are closely 
related to the evaluation of tests. In particular any significant additions 
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to, or modifications of, multiple, regression technics in a test-theory setting 
might be viewed as relevant contributions of statistical methodology to 
the understanding of the effectiveness of a test in relation to others in- 
corporated within a battery. 

Michael and Perry (48) derived and Mestrated the use of an equation 
for estimating the critical score or scores in an independent variable 
(test) that would be necessary to assure inclusion of an individual in a 
designated category of a trichotomous dependent variable (criterion) at 
a given level of probability. Thus, for a criterion in which there could be 
three levels of performance, such as superior, satisfactory, or unsatisfac- 
tory, it is possible to determine the score on a test that would assure at a 
certain probability level an examinee’s placement in one of the three cate- 
gories. In the instance of prediction of membership in the middle category, 
two cutting scores were found to occur on several occasions. 

Introducing the concept of moderator variable as an auxiliary device 
to prediction, Saunders (65) developed a new type of regression equation 
consisting of additional terms involving products of the ordinary inde- 
pendent variables that make up the usual linear composite. These product 
terms are intricately related to the properties of the moderator variable. 

Perhaps of more theoretical than practical importance was Creager’s 
contribution (13) that dealt not only with the interrelationships among 
the statistics that describe linear composites, multiple regression, and 
factor analysis, but also with the estimation of certain unknown correla- 
tions such as those found among several criterion variables. The multiple- 
regression approach was central to the development of certain concepts in 
both factor theory and linear composites. 

Perhaps the most fundamental research entailing the application of 
multiple-regression technics to the prediction of criterion measures is 
to be found in two monographs by Horst (37, 38). In the two monographs 
the problems of the development of a differential prediction battery and 
of a multiple prediction battery were respectively considered. For differ- 
ential prediction, the problem posed was to select a specified number of 
predictors from several available ones that would yield simultaneously 
the most nearly accurate prediction of differences between scores on all 
possible pairs of criterion variables within a given set. For multiple 
absolute prediction, an attempt was made to select a given number of 
predictors such that the degree of accuracy with which all of the criterion 
variables are predicted will be at a maximum irrespective of the extent to 
which the chosen battery differentiates among the various criterion 
measures. 

In selecting a suitable index of differential prediction efficiency, Horst 
defined his measure relative to the variance of the predicted difference 
scores—the greater the variance, the higher the degree of differential 
prediction efficiency. In mathematical terms an equivalent definition was 
realized thru maximizing the amount of difference between the average 
variance and the average covariance of the predicted scores on the criterion 
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variables. Since the computational efforts required for the exact least. 
squares solution to the problem would constitute a prohibitive task, an 
iterative procedure was developed for successively choosing predictors, 
each one of which in combination with previously selected ones will 
furnish the maximum index of differential prediction. Cessation of the 
process takes place when an arbitrary number of predictors has been 
chosen. Accompanying a detailed explanation of the computational pro- 
cedure is a numerical example. 

In the instance of multiple absolute prediction, the sum of the variances 
of the predicted measures on the criterion variables, disregarding their 
average covariances, constitutes the measure of prediction of the chosen 
battery. When unweighted, or standard, criterion measures are employed, 
it can be shown that the sum of the variances is equal to the sum of the 
squares of the multiple correlations of each of the criteria with the pre- 
dictors. (Incidentally, the procedure can be readily extended to the situa- 
tion in which the criterion variables are assigned different weights rela- 
tive to their degree of judged importance.) As in the case of differential 
prediction, the weights associated with the predictors are those based on 
the familiar least-squares regression principle. A simplified iterative 
technic is described and illustrated for the same numerical problem con- 
sidered for the differential prediction procedure. In general one may not 
expect to find the same set of predictors chosen for the two types of pre- 
diction problems. 


Transformation of Scale Values 


Two important papers appeared that were concerned with the trans- 
formation of scores. Engelhart and Thomas (20) developed several equa- 
tions for transforming scores on the scale of a subgroup of a population 
to the scale of the entire population when two types of measures (instruc- 
tional scores and examination scores) are available for the subgroup and 
the total group. These equations should find a great deal of application 
in citywide and statewide testing programs as well as in certain college 
and university courses for which there are many sections. 

Anderson, Gray, and Kullstedt (3) presented tables for transforming 
orders of merit, or ranks, into normalized scores without the necessity of 
first computing the percent positions of the individuals of the group. 
These tables should be of considerable help to individuals engaged in the 
selection and supervision of personnel in industry and civil service as well 
as to the admissions officials in colleges and universities who find occasion 
to convert the rank of a high-school student in his class to a normalized 
score. 


Profile and Pattern Analysis 


Altho some consideration is given in Chapter IV to the area of profile 
and pattern analysis, it would seem desirable, perhaps for the sake of 
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completeness, to cite certain papers that were concerned with one or more 
statistical aspects. In the assessment of similarity between profiles, Cronbach 
and Gleser (14) cited general methodological difficulties, introduced a 
statistical model to describe the concept of similarity between persons 
that furnishes a basis for systematic consideration of the various assump- 
tions pertaining to the better known measures of profile similarity, and 
stressed the fact that different treatments lead to different conclusions. 
They suggested that the most satisfactory model might be to consider 
tests as coordinates, to conceive of a given person’s set of scores to be 
a point in the test space, and to compute a D measure that is a function 
of the distance between points as an index of the similarity between sets 
of scores. Interested also in a mathematical approach, Tiedeman (66) pro- 
posed a geometric model relative to four profile problems and discussed 
the underlying assumptions. 

Somewhat less mathematical in its tenor was the comprehensive paper 
by Gaier and Lee (27) in which three aspects of the problem of pattern 
analysis were considered: pattern representation, pattern matching, and 
pattern prediction. 

Achieving an important analytic triumph, Horst (35) proposed a mathe- 
matical formula that expresses the exact relationship between what is de- 
fined as the “configural” validity of two test items and the inter-item phi 
coefficients for each of two subgroups of a dichotomous criterion. Em- 
ploying two examples from studies by two different investigators, Horst 
illustrated the use of his formula in the determination of a configural score. 
It was suggested that’ in the analysis of configurations of responses the 
use of nonlinear models might be more fruitful than the conventional linear 
equations. Another important analytic advance was that of Sakoda (64) 
who demonstrated a relationship between the D-score (a measure of 
pattern similarity) proposed by Osgood and Suci (54) and scores ob- 
tained from application of the Q-technique of factor analysis. He was able 
to show that it was not necessary to equalize the means and variances of 
the scores for each individual. 

Seemingly related to the concept of pattern analysis was the empirical 
study carried out by Fowler (25) who interrelated Ferguson’s indexes of 
item conformity and person conformity. The responses of 100 pupils in 
Grade VIII to a vocabulary test of 74 items were studied in terms of 
Ferguson’s measures. Altho the study was carefully done it was not possi- 
ble, as one might expect, to draw conclusions as to the probable reasons 
that would explain why a pupil with a relatively high total score misses 
certain easy items and another pupil with a relatively low total score an- 
swers several difficult items correctly. 





Other Contributions 


Two statistical contributions of considerable importance in the area 
of test construction and evaluation that are not readily classified with 
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respect to the chosen categories of presentation were the papers by Horst 
(34) and Plumlee (62). Plumlee first derived a formula for the estima. 
tion of the validity of a multiple-choice examination in relation to the 
number of options, the number of items in the test, the mean of the test, 
and its variance, and then showed that the theoretical estimate of the 
validity of the multiple-choice test is less than that obtained when the test 
is in an answer-only form. She then carried out an empirical study with 
mathematics test items that were prepared with answer options (multiple- 
choice form) and without answer options (answer-only form). The 
results of the study indicated that the multiple-choice form was as predic- 
tive of success in one course as the answer-only form. It was pointed out 
that the assumption most likely to be violated is that chance alone deter- 
mines the selection of options when the examinee does not know the 
answer to an item. Horst derived a formula that yields the maximum 
theoretical correlation between two tests in multiple-choice form in terms 
of the distribution of the correct responses to items in the two tests and 
the probability of chance success. In the numerical example presented, it 
was interesting to note that a reduction in validity from .744 to only 
.693 occurred when a chance factor of .20, associated with five options, 
was introduced. 


Bibliography 


1. AMERICAN EDUCATIONAL RESEARCH ASSOCIATION. “Statistical Methodology in 
Educational Research.” Review of Educational Research 24: 353-536; December 
1954, 

2. ANASTASI, ANNE. “An Empirical Study of the Applicability of Sequential Analysis 
to Item Selection.” Educational and Psychological Measurement 13: 3-13; 
Spring 1953. 

3. ANDERSON, KENNETH E.; GRAY, ROBERT T.; and KULLSTEDT, Ernar V. “Tables 
for Transmutation of Orders of Merit into Units of Amounts or Scores.” 
Journal of Experimental Education 22: 247-55; March 1954. 

4. ANGOFF, WILLIAM H. “Test Reliability and Effective Test Length.” Psychometrika 
18: 1-14; March 1953. 

5. APPEL, VALENTINE. “Companion Nomographs for Testing the Significance of the 
Difference Between Uncorrelated Percentages.” Psychometrika 17: 325-30; Sep- 
tember 1952. 

6. APPEL, VALENTINE, and Kipnis, Davip. “The Use of Levels of Confidence in 
Item Analysis.” Journal of Applied Psychology 38: 256-59; August 1954. 

7. Bouvier, EuGENE A., and OTHERS. “A Study of the Error in the Cosine-Pi 
Approximation to the Tetrachoric Coefficient of Correlation.” Educational and 
Psychological Measurement 14: 690-99; Winter 1954. 

8. Bricut, Harotp F. “A Method for Computing the Kendall Tau-Coefficient.” 
Educational and Psychological Measurement 14: 700-704; Winter 1954. 

9. BrocpEN, Husert E. “A Rationale for Minimizing Distortion in Personality 
Questionnaire Keys.” Psychometrika 19: 141-48; ) Bre. 1954. 

10. Burcess, GeorcE G. “Use of Sequential Analysis for Determining Test Item 
Difficulty Level.” Educational and Psychological Measurement 15: 80-86; Spring 
1955. 

1l. CAFFREY, JOHN, and WHEELER, FRED. “A Simplified X* Formula for Rapid Com- 
utation of Certain Item-Analysis Data with IBM Punched-Card Equipment.” 
ournal of Experimental Education 21: 265-69; March 1953. 

12. CorFMAN, WILLIAM. “Estimating the Internal Consistency of a Test When 

Items Are Scored 2, 1, or 0.” Educational and Psychological Measurement 
12: 392-93; Autumn 1952. 


106 


February 1956 STATISTICAL METHODS 





13. CREAGER, JOHN A. Some Relations among Linear Composites, Multiple Regres- 
sion and Factor Analysis Useful in Estimating Unknown Correlation. U. S. Air 
Force Personnel and Training Research Center, Research Bulletin 54-107. 
Lackland Air Force Base, Texas: Personnel and Training Research Center, 1954. 
20 p. 

. CRONBACH, LEE J., and GLESER, GOLDINE C. “Assessing Similarity Between 
Profiles.” Psychological Bulletin 50: 456-73; November 1953. 

. CRONBACH, LEE J., and HARTMANN, WALTER. “A Note on Negative Reliabilities.” 
Educational and Psychological Measurement 14: 342-46; Summer 1954. 

. Daviporr, Metvin D. “Note on ‘A Table for the Rapid Determination of the 
Tetrachoric Correlation Coefficient.’” Psychometrika 19: 163-64; June 1954. 

. DaviporF, MELVIN D., and GOHEEN, HowarD W. “A Table for the Rapid 
Determination of the Tetrachoric Correlation Coefficient.” Psychometrika 18: 
115-21; June 1953. 

. DinemaAN, Harvey F. “A Computing Chart for the Point Biserial Correlation 
Coefficient.” Psychometrika 19: 257-59; September 1954. 

. Eset, Ropert L. “Procedures for the Analysis of Classroom Tests.” Educational 
and Psychological Measurement 14: 352-64; Summer 1954. 

. ENGELHART, Max D., and THOMAS, MACKLIN. “A Procedure for Transforming 
Scores Unique to Part of a Student Population to the Scale of a Common 
Examination Taken by the Entire Student Population.” Educational and Psy- 
chological Measurement 13: 248-63; Summer 1953. 

. Fan, CHUNG-TEH. Item Analysis Table. Princeton, N. J.: Educational Testing 
Service, 1952. 32 R 

. Fan, CHUNG-TEH. “Note on Construction of an Item Analysis Table for the High- 
Low-27-Per-Cent Group Method.” Psychometrika 19: 231-37; September 1954. 

. FELDMAN, MARVIN J. “The Effects of the Size of Criterion Groups and the Level 
of Significance in Selecting Test Items on the Validity of Tests.” Educational 
and Psychological Measurement 13: 273-79; Summer 1953. 

. FINDLEY, WARREN G. “A Rationale for Evaluation of Item Discrimination Sta- 
tistics.” (Abstract) American Psychologist 9: 365-66; August 1954. 

. Fow.er, H. M. “An Application of the Ferguson Method of Computing Item 
Conformity and Person Conformity.” Journal of Experimental Education 22: 
237-45; March 1954. 

. FRIEDMAN, NORMAN. “The Quartile Difference Method of Item Selection.” Journal 
of Applied Psychology 37: 356-60; October 1953. 

. GareR, EuGENE L., and LEE, MARILYN C. “Pattern Analysis: The Configural 
Approach to Predictive Measurement.” Psychological Bulletin 50: 140-48; March 
1953. 

. GREEN, Bert F. “A Note on Item Selection for Maximum Validity.” Educational 
and Psychological Measurement 14: 161-64; Spring 1954. 

. Gurtrorp, J. P. “The Correlation of an Item with a Composite of the Remaining 
Items in a Test.” Educational and Psychological Measurement 13: 87-93; Spring 
1953. 

. GUTTMAN, Louts. “Reliability Formulas for Noncompleted or Speeded Tests.” 
Psychometrika 20: 113-24; June 1955. 

. Gutrman, Louts. “Reliability Formulas That Do Not Assume Experimental Inde- 
pendence.” Psychometrika 18: 225-39; December 1953. 

. Horst, Pau.. “Correcting the Kuder-Richardson Reliability for Dispersion of 
Item Difficulties.” Psychological Bulletin 50: 371-74; September 1953. 

. Horst, PAu. “The Estimation of Immediate Retest Reliability.” Educational and 

_ Psychological Measurement 14: 705-708; Winter 1954. 

. Horst, Paut. “The Maximum Expected Correlation Between Two Multiple- 
Choice Tests.” Psychometrika 19: 291-96; December 1954. 

. Horst, PauL. “Pattern Analysis and Configural Scoring.” Journal of Clinical 
Psychology 10: 1-11; January 1954. 

. Horst, PauL. “Relationships Between Several Kuder-Richardson Reliability For- 
mulas.” Educational and Psychological Measurement 13: 497-504; Autumn 1953. 

. Horst, Paut. A Technique for the Development of a Differential Prediction Bat- 
tery. Psychological Monographs, No. 380. Washington, D. C.: American Psy- 
chological Association, 1954. 31 p. 

. Horst, Paut. A Technique for the Development of a Multiple Absolute Predic- 
tion Battery. Psychological Monographs, No. 390. Washington, D. C.: American 
Psychological Association, 1955. 22 p. 107 








REVIEW OF EDUCATIONAL RESEARCH Vol. XXVI, No. 1 





39. Hoyt, Cyrit J., and STUNKARD, CLAYTON L. “Estimation of Test Reliability for 
Unrestricted Item Scoring Methods.” Educational and Psychological Measure. 
ment 12: 756-58; Winter 1952. 

40. Hsii, E. H. “Nomograph for Tetrachoric r.” Educational and Psychological Meas. 
urement 13: 339-46; Summer 1953. 

41. KIRKPATRICK, JAMES J., and CURETON, EDWARD E. “Simplified Tables for Item 
Analysis.” Educational and Psychological Measurement 14: 709-14; Winter 1954. 

. LOEVINGER, JANE. “The Attenuation Paradox in Test Theory.” Psychological Bul- 
letin 51: 493-504; September 1954. 

. LOEVINGER, JANE. “Effect of Distortions of Measurement on Item Selection.” Edu- 
cational and Psychological Measurement 14: 441-48; Autumn 1954. 

. LOEVINGER, JANE; GLESER, GOLDINE C.; and DuBots, PHiLip H. “Maximizing 
the Discriminating Power of a Multiple-Score Test.” Psychometrika 18: 309-17; 
December 1953. 

. LorpD, FREDERIC M. “Sampling Fluctuations Resulting from the Sampling of Test 
Items.” Psychometrika 20: 1-22; March 1955. 

. MacLean, AnGus G., and Tait, ARHUR T. “A Procedure for Analyzing a Test 
and Maximizing Its Reliability.” Journal of Experimental Education 22: 273-78; 
March 1954. 

. MacLean, Ancus G., and Tait, ARTHUR T. “Some Computational Short-Cuts 
in the Development or Analysis of Tests.” Journal of Applied Psychology 38: 
260-63; August 1954. 

. MICHAEL, WILLIAM B., and PERRY, NORMAN C. “The Prediction of Membership 
in a Trichotomous Dependent Variable from Scores in a Continuous Independent 
Variable.” Educational and Psychological Measurement 12: 368-91; Autumn 
1952. 

. MICHAEL, WILLIAM B.; HERTZKA, ALFRED P.; and PERRY, NORMAN C. “Abacs 
for the Rapid Estimation of a Tetrachoric Coefficient from a Phi Coefficient 
Calculated from Use of Contrasted Groups.” Educational and Psychological 
Measurement 13: 478-85; Autumn 1953. 

. MICHAEL, WILLIAM B.; HERTZKA, ALFRED F.; and PERRY, NORMAN C. “Errors 
in Estimates of Item Difficulty Obtained from Use of Extreme Groups on a 
Criterion Variable.” Educational and Psychological Measurement 13: 601-606; 
Winter 1953. 

. MICHAEL, WILLIAM B.; PERRY, NORMAN C.; and GuILForp, J. P. “The Esti- 
mation of a Point Biserial Coefficient of Correlation from a Phi Coefficient.” 
British Journal of Psychology: Statistical Section 5: 139-50; November 1952. 

. MICHAEL, WILLIAM B.; PERRY, NORMAN C.; and HERTZKA, ALFRED F. “Sys- 
tematic Error in Estimates of Tetrachoric R.” Educational and Psychological 
Measurement 12: 515-24; Autumn 1952. 

. NOBLE, CLYDE E. “Scale Reliability and the Spearman-Brown Equation.” Educa- 
tional and Psychological Measurement 15: 195-205; Summer 1955. 

. Oscoop, CHARLES E., and Suct, GEorcE J. “A Measure of Relation Determined 
by Both Mean Difference and Profile Information.” Psychological Bulletin 49: 
251-62; May 1952. 

. Payne, M. Carr, JR., and Staucas, LEONARD. “An IBM Method for Computing 
Intraserial Correlations.” Psychometrika 20: 87-92; March 1955. 

. Perry, NORMAN C., and MICHAEL, WILLIAM B. “The Relationship of the Tetra- 
choric Correlation Coefficient to the Phi Coefficient Estimated from the Extreme 
Tails of a Normal Distribution of Criterion Scores.” Educational and Psy- 
chological Measurement 12: 778-86; Winter 1952. 

. Perry, NORMAN C., and MICHAEL, WILLIAM B. “The Reliability of a Point 
Biserial Coefficient of Correlation.” Psychometrika 19: 313-25; December 1954. 

. Perry, NORMAN C., and MICHAEL, WILLIAM B. “A Tabulation of the Fiducial 
Limits for the Point Biserial Correlation Coefficient.” Educational and Psy- 
chological Measurement 14: 715-21; Winter 1954. 

. PERRY, NORMAN C., and OTHERS. Estimating the Tetrachoric Correlation Coef- 
ficient via I. a Cosine-Pi Table and II. Correction Graphs for Non-Median 
Dichotomization. Technical Memorandum No. 2. Los Angeles: University of 
Southern California, Department of Psychology, 1953. 8 p. 

60. PETERSEN, Ropert L. “A Graphic Method for Estimating the Significance of 
Differences Between Proportions or Percentages.” Educational and Psychological 
Measurement 15: 186-94; Summer 1955. 


108 





February 1956 STATISTICAL METHODS 





6]. PICKREL, EvAN W. The Relative Predictive Efficiency of Three Methods of Utilizing 
Scores from Biographical Inventories. U. S. Air Force Personnel and Training 
Research Center, Research Bulletin 54-73. Lackland Air Force Base, Texas: 
Personnel and Training Research Center, 1954. 23 p. 

. PLUMLEE, LYNNETTE. “The Predicted and Observed Effect of Chance Success 
on Multiple-Choice Test Validity.” Psychometrika 19: 65-70; March 1954. 

. RumMeL, J. Francis. “A Simplified Method for Determining the Proportion 
of Differences in Excess of Chance Proportions Used in Differential Prediction.” 
Educational and Psychological Measurement 13: 145-49; Spring 1953. 

. SakopA, JAMES M. “Osgood and Suci’s Measure of Pattern Similarity and Q- 
Technique Factor Analysis.” Psychometrika 19: 253-56; September 1954. 

. SauNDERS, Davip R. “The ‘Moderator Variable’ as a Useful Tool in Prediction.” 
Proceedings, 1954 Invitational Conference on Testing Problems. Princeton, N. J.: 
Educational Testing Service, 1955. p. 54-58. 

. TIEDEMAN, Davip V. “A Model for the Profile Problem.” Proceedings, 1953 
Invitational Conference on Testing Problems. Princeton, N. J.: Educational Test- 
ing Service, 1954. p. 54-75. 

. WEBSTER, HAROLD. “Approximating Maximum Test Validity by a Non-Para- 
metric Method.” Psychometrika 18: 207-12; September 1953. 

. WELSH, GEORGE S. “A Tabular Method of Obtaining Tetrachoric r with Median- 
Cut Variables.” Psychometrika 20: 83-85; March 1955. 

. WHerRY, ROBERT J., and WINER, BEN J. “A Method for Factoring Large Num- 
bers of Items.” Psychometrika 18: 161-79; June 1953. 











Index to Volume XXVI, No. 1 


Page citations are made to single pages; these are often the beginning of a chapter, 
section, or running discussion dealing with the topic. 


Academic achievement: prediction of, 18, 
75 

Achievement tests: scoring of, 78; types 
of, 72; uses of, 74 

Adjustment: inventories of, 26 

Aptitude test batteries: studies of, 15 

Aptitude tests: prediction of academic 
achievement, 18; prediction of success 
in professional training, 17; types of, 
14; validity of, 18 

Attenuation paradox: concept of, 92 


Correlation coefficient: computational 
aids, 96 


Diagnosis: use of achievement tests in, 78 
Differential prediction: technics of, 103 

Distortion: of responses to inventories, 36 
Distortion in items: concept of, 91 


Evaluation: of item effectiveness, 94; of 
sampling errors, 101; statistical meth- 
ods for, 89; use of tests in, 5 


Grades: reliability of, 78 


Interest: inventories of, 26 

Inventories: adjustment, 26; distortion 
of responses, 36; interests, 26; new 
developments, 29; new scales for, 32; 
norms, 40; personality, 26; reliability 
of, 40; validity of, 41 

Item analysis: computational aids, 96; 
methods of, 34, 89; sampling errors 
in, 101 

Item discrimination: measures of, 94 

Item selection: for homogeneous tests, 
89; procedures for, 89 

Item types: studies of, 34 


Learning: measures of, 74 


110 


Norms: for inventories, 40 


Pattern analysis: methods of, 38, 104 

Personality: inventories of, 26 

Prediction: of academic achievement, 18, 
75; of success in professional training, 
17; technics of, 102; use of projective 
technics in, 64 

Profile analysis: methods of, 38, 104 

Projective technics: discussion of, 56; 
norms, 63; reliability of, 58; uses of, 
64; validity of, 58 


Reliability: estimation of, 99; of grades, 
78; of inventory scores, 40; of projec- 
tive technics, 58; relation to item se- 
lection, 92 

Rorschach: studies of, 58 


Sampling errors: in item analysis, 10] 

Scales: transformations of, 104 

Scoring: of achievement tests, 78 

Sequential analysis: use in item selec- 
tion, 95 

Statistical methods: in test construction, 
89 


Test construction: item selection methods, 
89; statistical methods in, 89 

Test data: interpretation of, 80 

Test results: interpretation of, 7; uses 
of, 5 

Testing: programs, 7 

Tests: methods of keying, 92; see also 
Inventories, Projective technics 

Thematic Apperception Test: studies of, 
60 


Validity: in relation to attenuation para- 
dox, 92; of aptitude tests, 18; of in- 
ventories, 41; of projective technics, 
58 


Vo 





